From d.phillips at partner.samsung.com  Wed Mar  5 23:47:59 2014
From: d.phillips at partner.samsung.com (Daniel Phillips)
Date: Wed, 05 Mar 2014 23:47:59 -0800
Subject: Design note: Allocation Heuristics
Message-ID: <5318282F.6070102@partner.samsung.com>

Until now, Tux3 has relied on a simple, linear allocation strategy where we
always search for free blocks just after the last one allocated. This works
wonderfully well with an empty volume but is prone to fragmentation over 
time
as deletes open up holes that tend to be filled in by unrelated blocks.

A new patch set is now entering testing that attempts to provide good long
term fragmentation resistance by implementing techniques analogous to those
that have worked well for the Ext filesystems over the years. Similarly, we
hope to control fragmentation to the point where the services of a
defragmentation utility are seldom if ever required.

Static versus adaptive allocation policy

I expect that this work will proceed in two major stages, the first of
which is the current patch set. This implements a basic "static" allocation
policy where we establish an allocation goal for each inode at the time the
inode is created, and that goal persists over the life of the object. 
Later,
we will introduce "adaptive" behavior where inode allocation goals may 
change
over time in response to observed allocation patterns.

Hopefully, a simple static allocation policy will already work well enough
for Tux3 to be usable as a general purpose filesystem. After all, the Ext
filesystems get by very well with something not much different. But 
clearly,
even with good initial guesses, we may want to change our mind at some 
point
about the allocation goal for a given inode in response to changing volume
congestion patterns.

To support adaptive allocation, we will introduce a "goal" inode 
attribute to
override the default "inode number equals allocation goal" rule, to be 
used in
the case that congestion or other undesirable allocation conditions are
detected. For forward compatibility, we will introduce the goal attribute
before freezing the Tux3 layout definition, whether or not we are ready 
to use
it.

The rest of this note concerns the new, static allocation heuristics, 
which I
hope will perform fairly well.

Inode number to block address correspondence

The big simplifying idea for the current "static" allocation policy is, 
block
allocation goal corresponds roughly to inode number. This accomplishes two
things: 1) it gives a stable goal for each inode so that if rewritten, the
new blocks will tend to be allocated fairly close to the original and 2) it
makes the ordering of allocated blocks similar to the ordering of inodes 
that
own them, so a linear walk through the inode table will be a roughly linear
walk through allocated blocks. This should reduce seeking on spinning media.
For flash, it should help with erase block behavior.

Imposing the inode number to block address correspondence reduces the 
volume
layout problem to making good choices for inode numbers. However, it is
impossible in general to make a perfect inode number choice because at
creation time we do not know very much about how big a file will be, how 
many
files there will be in a directory, or which parts of the volume will be
congested or sparsely used in the future. And furthermore, all these 
factors
can vary through large ranges over time. So we content ourselves with some
assumptions: most files will be small and most directories will have a 
modest
number of files.

Some immediate weaknesses are apparent. Two large files with nearby inode
numbers will have similar block allocation goals. If these files grow 
slowly
by appending, like log files, it is likely that the two files will end up
intertwined, potentially causing extra seeking. If there are many large 
files
in the same directory, or a very large number of files in a directory, then
there will be congestion in the neighborhood of that directory and lengthy
linear searches might be needed for new allocations. For the time being, we
will just accept these weaknesses and consider what to do about them if we
observe issues in practice. Eventually, adaptive allocation and other
techniques will help.

Block goal extrapolation

For each file write, we extrapolate an allocation goal which is simply
the logical offset of the write plus the inode number, with "folding" as
described below. Thus, a write to the same logical address always gives the
same physical goal, whether the write is sequential or random.

Tux3 never overwrites data, but always writes new data into free space. How
does this "copy-on-write" interact with the goal extrapolation rule? If no
free space is available nearby (common for rewrite of a large file) then 
the
new blocks would end up far away from the original. However, if the file is
rewritten again, then chances are, the out of place blocks will return to
their "natural" position. If a very large file is rewritten so that the 
write
is broken across multiple deltas, the behavior becomes more interesting. 
The
first delta will be allocated far out of line, however it will leave a gap
into which the next delta can be written. After that process covers the 
entire
file, we would see that the entire file has moved lower, with just one 
region
allocated far out of line.

It seems as though copy-on-write effects should cause only a modest 
increase
in fragmentation, as long as some "slack space" is distributed throughout
the volume. The actual effect remain to be determined experimentally. 
This is
one case where the ability to overwrite in place like Ext4 could be an
efficiency advantage.

Low level segment allocator

The purpose of allocation policy is only to provide a starting point (goal)
for a free space search. The low level segment allocator then provides its
own, important semantics. It guarantees to search the entire volume if
necessary, to be sure that all free space on the volume can actually be 
used.

The low level allocator searches in two passes, first considering only
allocation groups with enough space to satisfy all or a large part of the
request, then searching all remaining allocation groups with any free 
space at
all. Thus, there is some fragmentation avoidance built into the low level
allocator.

The low level allocator uses the count map table to skip over full or 
nearly
full allocation groups efficiently. So if allocation policy does fail to
provide a good allocation goal due to congestion, the result may be a 
long but
fairly efficient search. A secondary result is potentially increased
fragmentation. Our objective for this new allocation work can therefore be
understood entirely in terms of reducing CPU cost of searching, and 
reducing
fragmentation.

Directory entry position to inode number correspondence

Mass file operations will often be performed in directory entry order,
therefore it is helpful to establish a correspondence between directory 
entry
order and inode number. With linear allocation, this happens naturally when
creating files in an empty volume, but breaks down when creating new 
files in
an existing directory, especially if holes exist in the directory where 
files
have been deleted. To improve behavior for such randomly created files, 
we let
the position of a new directory entry guide the choice of inode number. 
We use
a simple extrapolation of the directory entry logical address, and use the
inode number of the parent directory as the base.

This should produce good results whether the directory is extended by a new
entry or a previously deleted entry in the interior of the directory is
reused. There are some limitations. If an inode is hard linked or moved 
to a
new directory, its inode number will be out of line with others in the
destination directory. We also have an unnatural dependence of inode number
choice on the average length of names in a directory: longer names decrease
the density of the inode table. This is only a minor annoyance because Tux3
inodes are variable sized and leaving gaps between them does not cost much.
In theory, we could improve this behavior by computing a statistic for
average entry name length to use in the extrapolation calculation.

Directory positioning

As for all inodes, the allocation goal for a directory is simply its 
(possibly
folded) inode number, which is also the base position for extrapolating all
the inodes contained within the directory. Directories are extrapolated 
more
widely than files with the intention of leaving space between 
directories for
contained files. In a nearly empty volume, we spread directories out more
widely, on the assumption that they will contain relatively more files and
subdirectories. We also try to create directories in relatively empty
allocation groups. The threshold for emptiness is determined by the 
fullness
of the volume and how deeply the directory is in the filesystem tree. A
directory created at root level in a nearly empty volume will be created 
only
in a completely empty group, if one is available.

The threshold for emptiness is relaxed as the volume becomes more full, and
for directories deeper in the tree. The search for a relatively empty group
is fairly short. If no relatively empty group is found then a linear search
for a free inode number is performed, starting at the extrapolated 
directory
inode number.

For the time being, the directory group search is strictly linear, 
however it
is probable that a pseudo random search using the extrapolated goal as a 
key
will produce better results, a possible future improvement.

Directory extrapolation is clearly a flawed mechanism. It produces 
congestion
when the extrapolated positions of directories with different parents 
overlap.
It also assumes a limited number of files per directory - files beyond that
will overflow into the regions of other directories. For the moment, our
narrow goal is to improve on the situation where directories are not spread
out at all, and therefore maximally vulnerable to age related fragmentation.

Use linear allocation as much as possible

Simple linear allocation turns out to be the best allocation policy in a
surprisingly wide variety of cases. The new patch set does not abandon it
entirely. If files are being written in sequence within a delta then we use
linear allocation instead of extrapolating from the inode number and file
size. However, if the linear goal diverges too much from the extrapolated
goal then we use the extrapolated goal and continue linearly from there. We
define "too far" as something on the order of hard disk cylinder size in an
effort to avoid introducing gaps of more than a track. (Within a track, 
gaps
and out of order allocation are relatively less important because of the
track cache.)

The "too far" linear rule should also help with multitasking loads. If two
tasks are writing in different directories simultaneously, this rule should
prevent these writes from mixing their allocated blocks together.

With linear allocation, if we untar a big directory, then subdirectories 
and
their contents will always be allocated entirely inside of parent 
directories.
This is optimal if we read in the same order and reasonably good even for
breadth first access. However this arrangement will not age well because
there is no slack space for creating new files close to other files in the
same directory. The new patch set departs from this "embedded child" 
behavior
to always create directories outside their parent directories. This adds 
two
additional seeks per directory to a depth first traversal. What we hope to
gain in return is much better aging behavior, a result that remains to be
confirmed experimentally.

Folding inode numbers to block addresses

Both inode numbers and block addresses are 48 bits, so at a theoretical
maximum size of one exabyte, we have one inode per volume block, which 
should
be adequate. For smaller volumes, we may generate a volume address for any
inode number by "folding" inode numbers to lie within the volume address
range. The details of the folding algorithm are somewhat interesting. We 
take
the inode number modulo a volume "wrap", which is the smallest power of two
greater than the volume size, and fold again to half the wrap size if
necessary, yielding a block address strictly inside the volume. This 
algorithm
gives an inode number to volume address mapping that behaves well even 
if the
volume is resized. If volume size is reduced, nearby block allocation goals
will still be nearby. If volume size is increased, then an inode may 
"unfold"
to a new allocation goal and its blocks will eventually migrate to the new
goal when rewritten. Under many volume size changes, blocks of a given file
will only take on a few different allocation goals, so fragmentation is
controlled.

Summary

In summary, recent patches introduce the following major changes:

   * Inode number implies allocation goal
   * Extrapolate inode numbers from directory entry position
   * Create new directories in empty groups when volume is young
   * Low level extent search now driven by group counts

With these changes, we expect a modest regression in terms of increased
fragmentation on some loads for which linear allocation is optimal, in 
return
for a large improvement in terms of reduced fragmentation as a volume ages.
How much regression, and how much improvement remain to be determined
experimentally. The current set of heuristics might need to be adjusted and
can certainly be improved over time. At some point we expect to move on 
from
static allocation goal determination to adaptive. That said, it is possible
that with the current set of heuristics will already perform 
competitively to
the point where allocation issues are no longer a barrier to using Tux3.

Regards,

Daniel


From d.phillips at partner.samsung.com  Wed Mar  5 23:59:57 2014
From: d.phillips at partner.samsung.com (Daniel Phillips)
Date: Wed, 05 Mar 2014 23:59:57 -0800
Subject: Design note: Allocation Heuristics
Message-ID: <53182AFD.1090508@partner.samsung.com>

Until now, Tux3 has relied on a simple, linear allocation strategy
where we always search for free blocks just after the last one
allocated. This works wonderfully well with an empty volume but is prone
to fragmentation over time as deletes open up holes that tend to be
filled in by unrelated blocks.

A new patch set is now entering testing that attempts to provide good
long term fragmentation resistance by implementing techniques analogous
to those that have worked well for the Ext filesystems over the years.
Similarly, we hope to control fragmentation to the point where the
services of a defragmentation utility are seldom if ever required.

Static versus adaptive allocation policy

I expect that this work will proceed in two major stages, the first of
which is the current patch set. This implements a basic "static"
allocation policy where we establish an allocation goal for each inode
at the time the inode is created, and that goal persists over the life
of the object. Later, we will introduce "adaptive" behavior where inode
allocation goals may change over time in response to observed allocation
patterns.

Hopefully, a simple static allocation policy will already work well
enough for Tux3 to be usable as a general purpose filesystem. After all,
the Ext filesystems get by very well with something not much different.
But clearly, even with good initial guesses, we may want to change our
mind at some point about the allocation goal for a given inode in
response to changing volume congestion patterns.

To support adaptive allocation, we will introduce a "goal" inode
attribute to override the default "inode number equals allocation goal"
rule, to be used in the case that congestion or other undesirable
allocation conditions are detected. For forward compatibility, we will
introduce the goal attribute before freezing the Tux3 layout definition,
whether or not we are ready to use it.

The rest of this note concerns the new, static allocation heuristics,
which I hope will perform fairly well.

Inode number to block address correspondence

The big simplifying idea for the current "static" allocation policy is,
block allocation goal corresponds roughly to inode number. This
accomplishes two things: 1) it gives a stable goal for each inode so
that if rewritten, the new blocks will tend to be allocated fairly close
to the original and 2) it makes the ordering of allocated blocks similar
to the ordering of inodes that own them, so a linear walk through the
inode table will be a roughly linear walk through allocated blocks. This
should reduce seeking on spinning media. For flash, it should help with
erase block behavior.

Imposing the inode number to block address correspondence reduces the
volume layout problem to making good choices for inode numbers. However,
it is impossible in general to make a perfect inode number choice
because at creation time we do not know very much about how big a file
will be, how many files there will be in a directory, or which parts of
the volume will be congested or sparsely used in the future. And
furthermore, all these factors can vary through large ranges over time.
So we content ourselves with some assumptions: most files will be small
and most directories will have a modest number of files.

Some immediate weaknesses are apparent. Two large files with nearby
inode numbers will have similar block allocation goals. If these files
grow slowly by appending, like log files, it is likely that the two
files will end up intertwined, potentially causing extra seeking. If
there are many large files in the same directory, or a very large number
of files in a directory, then there will be congestion in the
neighborhood of that directory and lengthy linear searches might be
needed for new allocations. For the time being, we will just accept
these weaknesses and consider what to do about them if we observe issues
in practice. Eventually, adaptive allocation and other techniques will
help.

Block goal extrapolation

For each file write, we extrapolate an allocation goal which is simply
the logical offset of the write plus the inode number, with "folding" as
described below. Thus, a write to the same logical address always gives
the same physical goal, whether the write is sequential or random.

Tux3 never overwrites data, but always writes new data into free space.
How does this "copy-on-write" interact with the goal extrapolation rule?
If no free space is available nearby (common for rewrite of a large
file) then the new blocks would end up far away from the original.
However, if the file is rewritten again, then chances are, the out of
place blocks will return to their "natural" position. If a very large
file is rewritten so that the write is broken across multiple deltas,
the behavior becomes more interesting. The first delta will be allocated
far out of line, however it will leave a gap into which the next delta
can be written. After that process covers the entire file, we would see
that the entire file has moved lower, with just one region allocated far
out of line.

It seems as though copy-on-write effects should cause only a modest
increase in fragmentation, as long as some "slack space" is distributed
throughout the volume. The actual effect remain to be determined
experimentally. This is one case where the ability to overwrite in place
like Ext4 could be an efficiency advantage.

Low level segment allocator

The purpose of allocation policy is only to provide a starting point
(goal) for a free space search. The low level segment allocator then
provides its own, important semantics. It guarantees to search the
entire volume if necessary, to be sure that all free space on the volume
can actually be used.

The low level allocator searches in two passes, first considering only
allocation groups with enough space to satisfy all or a large part of
the request, then searching all remaining allocation groups with any
free space at all. Thus, there is some fragmentation avoidance built
into the low level allocator.

The low level allocator uses the count map table to skip over full or
nearly full allocation groups efficiently. So if allocation policy does
fail to provide a good allocation goal due to congestion, the result may
be a long but fairly efficient search. A secondary result is potentially
increased fragmentation. Our objective for this new allocation work can
therefore be understood entirely in terms of reducing CPU cost of
searching, and reducing fragmentation.

Directory entry position to inode number correspondence

Mass file operations will often be performed in directory entry order,
therefore it is helpful to establish a correspondence between directory
entry order and inode number. With linear allocation, this happens
naturally when creating files in an empty volume, but breaks down when
creating new files in an existing directory, especially if holes exist
in the directory where files have been deleted. To improve behavior for
such randomly created files, we let the position of a new directory
entry guide the choice of inode number. We use a simple extrapolation of
the directory entry logical address, and use the inode number of the
parent directory as the base.

This should produce good results whether the directory is extended by a
new entry or a previously deleted entry in the interior of the directory
is reused. There are some limitations. If an inode is hard linked or
moved to a new directory, its inode number will be out of line with
others in the destination directory. We also have an unnatural
dependence of inode number choice on the average length of names in a
directory: longer names decrease the density of the inode table. This is
only a minor annoyance because Tux3 inodes are variable sized and
leaving gaps between them does not cost much. In theory, we could
improve this behavior by computing a statistic for average entry name
length to use in the extrapolation calculation.

Directory positioning

As for all inodes, the allocation goal for a directory is simply its
(possibly folded) inode number, which is also the base position for
extrapolating all the inodes contained within the directory. Directories
are extrapolated more widely than files with the intention of leaving
space between directories for contained files. In a nearly empty volume,
we spread directories out more widely, on the assumption that they will
contain relatively more files and subdirectories. We also try to create
directories in relatively empty allocation groups. The threshold for
emptiness is determined by the fullness of the volume and how deeply the
directory is in the filesystem tree. A directory created at root level
in a nearly empty volume will be created only in a completely empty
group, if one is available.

The threshold for emptiness is relaxed as the volume becomes more full,
and for directories deeper in the tree. The search for a relatively
empty group is fairly short. If no relatively empty group is found then
a linear search for a free inode number is performed, starting at the
extrapolated directory inode number.

For the time being, the directory group search is strictly linear,
however it is probable that a pseudo random search using the
extrapolated goal as a key will produce better results, a possible
future improvement.

Directory extrapolation is clearly a flawed mechanism. It produces
congestion when the extrapolated positions of directories with different
parents overlap. It also assumes a limited number of files per directory
- files beyond that will overflow into the regions of other directories.
For the moment, our narrow goal is to improve on the situation where
directories are not spread out at all, and therefore maximally
vulnerable to age related fragmentation.

Use linear allocation as much as possible

Simple linear allocation turns out to be the best allocation policy in
a surprisingly wide variety of cases. The new patch set does not abandon
it entirely. If files are being written in sequence within a delta then
we use linear allocation instead of extrapolating from the inode number
and file size. However, if the linear goal diverges too much from the
extrapolated goal then we use the extrapolated goal and continue
linearly from there. We define "too far" as something on the order of
hard disk cylinder size in an effort to avoid introducing gaps of more
than a track. (Within a track, gaps and out of order allocation are
relatively less important because of the track cache.)

The "too far" linear rule should also help with multitasking loads. If
two tasks are writing in different directories simultaneously, this rule
should prevent these writes from mixing their allocated blocks together.

With linear allocation, if we untar a big directory, then subdirectories
and their contents will always be allocated entirely inside of parent
directories. This is optimal if we read in the same order and reasonably
good even for breadth first access. However this arrangement will not
age well because there is no slack space for creating new files close to
other files in the same directory. The new patch set departs from this
"embedded child" behavior to always create directories outside their
parent directories. This adds two additional seeks per directory to a
depth first traversal. What we hope to gain in return is much better
aging behavior, a result that remains to be confirmed experimentally.

Folding inode numbers to block addresses

Both inode numbers and block addresses are 48 bits, so at a theoretical
maximum size of one exabyte, we have one inode per volume block, which
should be adequate. For smaller volumes, we may generate a volume
address for any inode number by "folding" inode numbers to lie within
the volume address range. The details of the folding algorithm are
somewhat interesting. We take the inode number modulo a volume "wrap",
which is the smallest power of two greater than the volume size, and
fold again to half the wrap size if necessary, yielding a block address
strictly inside the volume. This algorithm gives an inode number to
volume address mapping that behaves well even if the volume is resized.
If volume size is reduced, nearby block allocation goals will still be
nearby. If volume size is increased, then an inode may "unfold" to a new
allocation goal and its blocks will eventually migrate to the new goal
when rewritten. Under many volume size changes, blocks of a given file
will only take on a few different allocation goals, so fragmentation is
controlled.

Summary

In summary, recent patches introduce the following major changes:

  * Inode number implies allocation goal
  * Extrapolate inode numbers from directory entry position
  * Create new directories in empty groups when volume is young
  * Low level extent search now driven by group counts

With these changes, we expect a modest regression in terms of increased
fragmentation on some loads for which linear allocation is optimal, in
return for a large improvement in terms of reduced fragmentation as a
volume ages. How much regression, and how much improvement remain to be
determined experimentally.

The current set of heuristics might need to be adjusted and can
certainly be improved over time. At some point we expect to move on from
static allocation goal determination to adaptive. That said, it is
possible that with the current set of heuristics will already perform
competitively to the point where allocation issues are no longer a
barrier to using Tux3.

Regards,

Daniel