Design note: Allocation Heuristics

Wed Mar 5 23:59:57 PST 2014

Until now, Tux3 has relied on a simple, linear allocation strategy
where we always search for free blocks just after the last one
allocated. This works wonderfully well with an empty volume but is prone
to fragmentation over time as deletes open up holes that tend to be
filled in by unrelated blocks.

A new patch set is now entering testing that attempts to provide good
long term fragmentation resistance by implementing techniques analogous
to those that have worked well for the Ext filesystems over the years.
Similarly, we hope to control fragmentation to the point where the
services of a defragmentation utility are seldom if ever required.

Static versus adaptive allocation policy

I expect that this work will proceed in two major stages, the first of
which is the current patch set. This implements a basic "static"
allocation policy where we establish an allocation goal for each inode
at the time the inode is created, and that goal persists over the life
of the object. Later, we will introduce "adaptive" behavior where inode
allocation goals may change over time in response to observed allocation
patterns.

Hopefully, a simple static allocation policy will already work well
enough for Tux3 to be usable as a general purpose filesystem. After all,
the Ext filesystems get by very well with something not much different.
But clearly, even with good initial guesses, we may want to change our
mind at some point about the allocation goal for a given inode in
response to changing volume congestion patterns.

To support adaptive allocation, we will introduce a "goal" inode
attribute to override the default "inode number equals allocation goal"
rule, to be used in the case that congestion or other undesirable
allocation conditions are detected. For forward compatibility, we will
introduce the goal attribute before freezing the Tux3 layout definition,
whether or not we are ready to use it.

The rest of this note concerns the new, static allocation heuristics,
which I hope will perform fairly well.

Inode number to block address correspondence

The big simplifying idea for the current "static" allocation policy is,
block allocation goal corresponds roughly to inode number. This
accomplishes two things: 1) it gives a stable goal for each inode so
that if rewritten, the new blocks will tend to be allocated fairly close
to the original and 2) it makes the ordering of allocated blocks similar
to the ordering of inodes that own them, so a linear walk through the
inode table will be a roughly linear walk through allocated blocks. This
should reduce seeking on spinning media. For flash, it should help with
erase block behavior.

Imposing the inode number to block address correspondence reduces the
volume layout problem to making good choices for inode numbers. However,
it is impossible in general to make a perfect inode number choice
because at creation time we do not know very much about how big a file
will be, how many files there will be in a directory, or which parts of
the volume will be congested or sparsely used in the future. And
furthermore, all these factors can vary through large ranges over time.
So we content ourselves with some assumptions: most files will be small
and most directories will have a modest number of files.

Some immediate weaknesses are apparent. Two large files with nearby
inode numbers will have similar block allocation goals. If these files
grow slowly by appending, like log files, it is likely that the two
files will end up intertwined, potentially causing extra seeking. If
there are many large files in the same directory, or a very large number
of files in a directory, then there will be congestion in the
neighborhood of that directory and lengthy linear searches might be
needed for new allocations. For the time being, we will just accept
these weaknesses and consider what to do about them if we observe issues
in practice. Eventually, adaptive allocation and other techniques will
help.

Block goal extrapolation

For each file write, we extrapolate an allocation goal which is simply
the logical offset of the write plus the inode number, with "folding" as
described below. Thus, a write to the same logical address always gives
the same physical goal, whether the write is sequential or random.

Tux3 never overwrites data, but always writes new data into free space.
How does this "copy-on-write" interact with the goal extrapolation rule?
If no free space is available nearby (common for rewrite of a large
file) then the new blocks would end up far away from the original.
However, if the file is rewritten again, then chances are, the out of
place blocks will return to their "natural" position. If a very large
file is rewritten so that the write is broken across multiple deltas,
the behavior becomes more interesting. The first delta will be allocated
far out of line, however it will leave a gap into which the next delta
can be written. After that process covers the entire file, we would see
that the entire file has moved lower, with just one region allocated far
out of line.

It seems as though copy-on-write effects should cause only a modest
increase in fragmentation, as long as some "slack space" is distributed
throughout the volume. The actual effect remain to be determined
experimentally. This is one case where the ability to overwrite in place
like Ext4 could be an efficiency advantage.

Low level segment allocator

The purpose of allocation policy is only to provide a starting point
(goal) for a free space search. The low level segment allocator then
provides its own, important semantics. It guarantees to search the
entire volume if necessary, to be sure that all free space on the volume
can actually be used.

The low level allocator searches in two passes, first considering only
allocation groups with enough space to satisfy all or a large part of
the request, then searching all remaining allocation groups with any
free space at all. Thus, there is some fragmentation avoidance built
into the low level allocator.

The low level allocator uses the count map table to skip over full or
nearly full allocation groups efficiently. So if allocation policy does
fail to provide a good allocation goal due to congestion, the result may
be a long but fairly efficient search. A secondary result is potentially
increased fragmentation. Our objective for this new allocation work can
therefore be understood entirely in terms of reducing CPU cost of
searching, and reducing fragmentation.

Directory entry position to inode number correspondence

Mass file operations will often be performed in directory entry order,
therefore it is helpful to establish a correspondence between directory
entry order and inode number. With linear allocation, this happens
naturally when creating files in an empty volume, but breaks down when
creating new files in an existing directory, especially if holes exist
in the directory where files have been deleted. To improve behavior for
such randomly created files, we let the position of a new directory
entry guide the choice of inode number. We use a simple extrapolation of
the directory entry logical address, and use the inode number of the
parent directory as the base.

This should produce good results whether the directory is extended by a
new entry or a previously deleted entry in the interior of the directory
is reused. There are some limitations. If an inode is hard linked or
moved to a new directory, its inode number will be out of line with
others in the destination directory. We also have an unnatural
dependence of inode number choice on the average length of names in a
directory: longer names decrease the density of the inode table. This is
only a minor annoyance because Tux3 inodes are variable sized and
leaving gaps between them does not cost much. In theory, we could
improve this behavior by computing a statistic for average entry name
length to use in the extrapolation calculation.

Directory positioning

As for all inodes, the allocation goal for a directory is simply its
(possibly folded) inode number, which is also the base position for
extrapolating all the inodes contained within the directory. Directories
are extrapolated more widely than files with the intention of leaving
space between directories for contained files. In a nearly empty volume,
we spread directories out more widely, on the assumption that they will
contain relatively more files and subdirectories. We also try to create
directories in relatively empty allocation groups. The threshold for
emptiness is determined by the fullness of the volume and how deeply the
directory is in the filesystem tree. A directory created at root level
in a nearly empty volume will be created only in a completely empty
group, if one is available.

The threshold for emptiness is relaxed as the volume becomes more full,
and for directories deeper in the tree. The search for a relatively
empty group is fairly short. If no relatively empty group is found then
a linear search for a free inode number is performed, starting at the
extrapolated directory inode number.

For the time being, the directory group search is strictly linear,
however it is probable that a pseudo random search using the
extrapolated goal as a key will produce better results, a possible
future improvement.

Directory extrapolation is clearly a flawed mechanism. It produces
congestion when the extrapolated positions of directories with different
parents overlap. It also assumes a limited number of files per directory
- files beyond that will overflow into the regions of other directories.
For the moment, our narrow goal is to improve on the situation where
directories are not spread out at all, and therefore maximally
vulnerable to age related fragmentation.

Use linear allocation as much as possible

Simple linear allocation turns out to be the best allocation policy in
a surprisingly wide variety of cases. The new patch set does not abandon
it entirely. If files are being written in sequence within a delta then
we use linear allocation instead of extrapolating from the inode number
and file size. However, if the linear goal diverges too much from the
extrapolated goal then we use the extrapolated goal and continue
linearly from there. We define "too far" as something on the order of
hard disk cylinder size in an effort to avoid introducing gaps of more
than a track. (Within a track, gaps and out of order allocation are
relatively less important because of the track cache.)

The "too far" linear rule should also help with multitasking loads. If
two tasks are writing in different directories simultaneously, this rule
should prevent these writes from mixing their allocated blocks together.

With linear allocation, if we untar a big directory, then subdirectories
and their contents will always be allocated entirely inside of parent
directories. This is optimal if we read in the same order and reasonably
good even for breadth first access. However this arrangement will not
age well because there is no slack space for creating new files close to
other files in the same directory. The new patch set departs from this
"embedded child" behavior to always create directories outside their
parent directories. This adds two additional seeks per directory to a
depth first traversal. What we hope to gain in return is much better
aging behavior, a result that remains to be confirmed experimentally.

Folding inode numbers to block addresses

Both inode numbers and block addresses are 48 bits, so at a theoretical
maximum size of one exabyte, we have one inode per volume block, which
should be adequate. For smaller volumes, we may generate a volume
address for any inode number by "folding" inode numbers to lie within
the volume address range. The details of the folding algorithm are
somewhat interesting. We take the inode number modulo a volume "wrap",
which is the smallest power of two greater than the volume size, and
fold again to half the wrap size if necessary, yielding a block address
strictly inside the volume. This algorithm gives an inode number to
volume address mapping that behaves well even if the volume is resized.
If volume size is reduced, nearby block allocation goals will still be
nearby. If volume size is increased, then an inode may "unfold" to a new
allocation goal and its blocks will eventually migrate to the new goal
when rewritten. Under many volume size changes, blocks of a given file
will only take on a few different allocation goals, so fragmentation is
controlled.

Summary

In summary, recent patches introduce the following major changes:

  * Inode number implies allocation goal
  * Extrapolate inode numbers from directory entry position
  * Create new directories in empty groups when volume is young
  * Low level extent search now driven by group counts

With these changes, we expect a modest regression in terms of increased
fragmentation on some loads for which linear allocation is optimal, in
return for a large improvement in terms of reduced fragmentation as a
volume ages. How much regression, and how much improvement remain to be
determined experimentally.

The current set of heuristics might need to be adjusted and can
certainly be improved over time. At some point we expect to move on from
static allocation goal determination to adaptive. That said, it is
possible that with the current set of heuristics will already perform
competitively to the point where allocation issues are no longer a
barrier to using Tux3.

Regards,

Daniel