Design note: Allocation Heuristics

Daniel Phillips d.phillips at partner.samsung.com
Wed Mar 5 23:47:59 PST 2014


Until now, Tux3 has relied on a simple, linear allocation strategy where we
always search for free blocks just after the last one allocated. This works
wonderfully well with an empty volume but is prone to fragmentation over 
time
as deletes open up holes that tend to be filled in by unrelated blocks.

A new patch set is now entering testing that attempts to provide good long
term fragmentation resistance by implementing techniques analogous to those
that have worked well for the Ext filesystems over the years. Similarly, we
hope to control fragmentation to the point where the services of a
defragmentation utility are seldom if ever required.

Static versus adaptive allocation policy

I expect that this work will proceed in two major stages, the first of
which is the current patch set. This implements a basic "static" allocation
policy where we establish an allocation goal for each inode at the time the
inode is created, and that goal persists over the life of the object. 
Later,
we will introduce "adaptive" behavior where inode allocation goals may 
change
over time in response to observed allocation patterns.

Hopefully, a simple static allocation policy will already work well enough
for Tux3 to be usable as a general purpose filesystem. After all, the Ext
filesystems get by very well with something not much different. But 
clearly,
even with good initial guesses, we may want to change our mind at some 
point
about the allocation goal for a given inode in response to changing volume
congestion patterns.

To support adaptive allocation, we will introduce a "goal" inode 
attribute to
override the default "inode number equals allocation goal" rule, to be 
used in
the case that congestion or other undesirable allocation conditions are
detected. For forward compatibility, we will introduce the goal attribute
before freezing the Tux3 layout definition, whether or not we are ready 
to use
it.

The rest of this note concerns the new, static allocation heuristics, 
which I
hope will perform fairly well.

Inode number to block address correspondence

The big simplifying idea for the current "static" allocation policy is, 
block
allocation goal corresponds roughly to inode number. This accomplishes two
things: 1) it gives a stable goal for each inode so that if rewritten, the
new blocks will tend to be allocated fairly close to the original and 2) it
makes the ordering of allocated blocks similar to the ordering of inodes 
that
own them, so a linear walk through the inode table will be a roughly linear
walk through allocated blocks. This should reduce seeking on spinning media.
For flash, it should help with erase block behavior.

Imposing the inode number to block address correspondence reduces the 
volume
layout problem to making good choices for inode numbers. However, it is
impossible in general to make a perfect inode number choice because at
creation time we do not know very much about how big a file will be, how 
many
files there will be in a directory, or which parts of the volume will be
congested or sparsely used in the future. And furthermore, all these 
factors
can vary through large ranges over time. So we content ourselves with some
assumptions: most files will be small and most directories will have a 
modest
number of files.

Some immediate weaknesses are apparent. Two large files with nearby inode
numbers will have similar block allocation goals. If these files grow 
slowly
by appending, like log files, it is likely that the two files will end up
intertwined, potentially causing extra seeking. If there are many large 
files
in the same directory, or a very large number of files in a directory, then
there will be congestion in the neighborhood of that directory and lengthy
linear searches might be needed for new allocations. For the time being, we
will just accept these weaknesses and consider what to do about them if we
observe issues in practice. Eventually, adaptive allocation and other
techniques will help.

Block goal extrapolation

For each file write, we extrapolate an allocation goal which is simply
the logical offset of the write plus the inode number, with "folding" as
described below. Thus, a write to the same logical address always gives the
same physical goal, whether the write is sequential or random.

Tux3 never overwrites data, but always writes new data into free space. How
does this "copy-on-write" interact with the goal extrapolation rule? If no
free space is available nearby (common for rewrite of a large file) then 
the
new blocks would end up far away from the original. However, if the file is
rewritten again, then chances are, the out of place blocks will return to
their "natural" position. If a very large file is rewritten so that the 
write
is broken across multiple deltas, the behavior becomes more interesting. 
The
first delta will be allocated far out of line, however it will leave a gap
into which the next delta can be written. After that process covers the 
entire
file, we would see that the entire file has moved lower, with just one 
region
allocated far out of line.

It seems as though copy-on-write effects should cause only a modest 
increase
in fragmentation, as long as some "slack space" is distributed throughout
the volume. The actual effect remain to be determined experimentally. 
This is
one case where the ability to overwrite in place like Ext4 could be an
efficiency advantage.

Low level segment allocator

The purpose of allocation policy is only to provide a starting point (goal)
for a free space search. The low level segment allocator then provides its
own, important semantics. It guarantees to search the entire volume if
necessary, to be sure that all free space on the volume can actually be 
used.

The low level allocator searches in two passes, first considering only
allocation groups with enough space to satisfy all or a large part of the
request, then searching all remaining allocation groups with any free 
space at
all. Thus, there is some fragmentation avoidance built into the low level
allocator.

The low level allocator uses the count map table to skip over full or 
nearly
full allocation groups efficiently. So if allocation policy does fail to
provide a good allocation goal due to congestion, the result may be a 
long but
fairly efficient search. A secondary result is potentially increased
fragmentation. Our objective for this new allocation work can therefore be
understood entirely in terms of reducing CPU cost of searching, and 
reducing
fragmentation.

Directory entry position to inode number correspondence

Mass file operations will often be performed in directory entry order,
therefore it is helpful to establish a correspondence between directory 
entry
order and inode number. With linear allocation, this happens naturally when
creating files in an empty volume, but breaks down when creating new 
files in
an existing directory, especially if holes exist in the directory where 
files
have been deleted. To improve behavior for such randomly created files, 
we let
the position of a new directory entry guide the choice of inode number. 
We use
a simple extrapolation of the directory entry logical address, and use the
inode number of the parent directory as the base.

This should produce good results whether the directory is extended by a new
entry or a previously deleted entry in the interior of the directory is
reused. There are some limitations. If an inode is hard linked or moved 
to a
new directory, its inode number will be out of line with others in the
destination directory. We also have an unnatural dependence of inode number
choice on the average length of names in a directory: longer names decrease
the density of the inode table. This is only a minor annoyance because Tux3
inodes are variable sized and leaving gaps between them does not cost much.
In theory, we could improve this behavior by computing a statistic for
average entry name length to use in the extrapolation calculation.

Directory positioning

As for all inodes, the allocation goal for a directory is simply its 
(possibly
folded) inode number, which is also the base position for extrapolating all
the inodes contained within the directory. Directories are extrapolated 
more
widely than files with the intention of leaving space between 
directories for
contained files. In a nearly empty volume, we spread directories out more
widely, on the assumption that they will contain relatively more files and
subdirectories. We also try to create directories in relatively empty
allocation groups. The threshold for emptiness is determined by the 
fullness
of the volume and how deeply the directory is in the filesystem tree. A
directory created at root level in a nearly empty volume will be created 
only
in a completely empty group, if one is available.

The threshold for emptiness is relaxed as the volume becomes more full, and
for directories deeper in the tree. The search for a relatively empty group
is fairly short. If no relatively empty group is found then a linear search
for a free inode number is performed, starting at the extrapolated 
directory
inode number.

For the time being, the directory group search is strictly linear, 
however it
is probable that a pseudo random search using the extrapolated goal as a 
key
will produce better results, a possible future improvement.

Directory extrapolation is clearly a flawed mechanism. It produces 
congestion
when the extrapolated positions of directories with different parents 
overlap.
It also assumes a limited number of files per directory - files beyond that
will overflow into the regions of other directories. For the moment, our
narrow goal is to improve on the situation where directories are not spread
out at all, and therefore maximally vulnerable to age related fragmentation.

Use linear allocation as much as possible

Simple linear allocation turns out to be the best allocation policy in a
surprisingly wide variety of cases. The new patch set does not abandon it
entirely. If files are being written in sequence within a delta then we use
linear allocation instead of extrapolating from the inode number and file
size. However, if the linear goal diverges too much from the extrapolated
goal then we use the extrapolated goal and continue linearly from there. We
define "too far" as something on the order of hard disk cylinder size in an
effort to avoid introducing gaps of more than a track. (Within a track, 
gaps
and out of order allocation are relatively less important because of the
track cache.)

The "too far" linear rule should also help with multitasking loads. If two
tasks are writing in different directories simultaneously, this rule should
prevent these writes from mixing their allocated blocks together.

With linear allocation, if we untar a big directory, then subdirectories 
and
their contents will always be allocated entirely inside of parent 
directories.
This is optimal if we read in the same order and reasonably good even for
breadth first access. However this arrangement will not age well because
there is no slack space for creating new files close to other files in the
same directory. The new patch set departs from this "embedded child" 
behavior
to always create directories outside their parent directories. This adds 
two
additional seeks per directory to a depth first traversal. What we hope to
gain in return is much better aging behavior, a result that remains to be
confirmed experimentally.

Folding inode numbers to block addresses

Both inode numbers and block addresses are 48 bits, so at a theoretical
maximum size of one exabyte, we have one inode per volume block, which 
should
be adequate. For smaller volumes, we may generate a volume address for any
inode number by "folding" inode numbers to lie within the volume address
range. The details of the folding algorithm are somewhat interesting. We 
take
the inode number modulo a volume "wrap", which is the smallest power of two
greater than the volume size, and fold again to half the wrap size if
necessary, yielding a block address strictly inside the volume. This 
algorithm
gives an inode number to volume address mapping that behaves well even 
if the
volume is resized. If volume size is reduced, nearby block allocation goals
will still be nearby. If volume size is increased, then an inode may 
"unfold"
to a new allocation goal and its blocks will eventually migrate to the new
goal when rewritten. Under many volume size changes, blocks of a given file
will only take on a few different allocation goals, so fragmentation is
controlled.

Summary

In summary, recent patches introduce the following major changes:

   * Inode number implies allocation goal
   * Extrapolate inode numbers from directory entry position
   * Create new directories in empty groups when volume is young
   * Low level extent search now driven by group counts

With these changes, we expect a modest regression in terms of increased
fragmentation on some loads for which linear allocation is optimal, in 
return
for a large improvement in terms of reduced fragmentation as a volume ages.
How much regression, and how much improvement remain to be determined
experimentally. The current set of heuristics might need to be adjusted and
can certainly be improved over time. At some point we expect to move on 
from
static allocation goal determination to adaptive. That said, it is possible
that with the current set of heuristics will already perform 
competitively to
the point where allocation issues are no longer a barrier to using Tux3.

Regards,

Daniel




More information about the Tux3 mailing list