daniel.raymond.phillips at gmail.com
Mon Jan 21 14:24:19 PST 2013
Here, I have jotted down some obvious facts about allocation strategy and
do not go too deeply into what we will actually implement for Tux3. Partly
because our thinking is still evolving, and partly because I want to keep this
note short. And admittedly, it ended up far from short in spite of my
considerable efforts to contain it. So please bear with me, and whoever
makes it all the way to the end, please let me know.
* Favor algorithms that optimize for spinning media in a way that
also helps solid state media or at least does not harm it.
* Perfect is the enemy of good enough. We want to improve our initial
layout and aging behavior to the point where Tux3 moves to the middle
of the pack, but not spend a lot of time ust now attempting to be the
very best. However, in principle there is no reason why Tux3 cannot
do layout at least as well as any existing filesystem,
Let us define the goodness of a given disk layout in terms of how many seeks
are needed to read all the files of some directory tree. For this purpose, a
seek is defined as any gap at all between the last block of a region that we
read into cache synchronously, and the first block of the next.
This definition of seeking does not quite match reality because a modern disk
will cache read data in units of entire tracks, so that a small gap between
read targets introduces only a small probability of head movement, increasing
as the gap becomes larger. This fact will help us in algorithm design, however
for the moment we will content ourselves with this theoretical definition.
The ideal layout is one where each block read into cache by a recursive
traversal of all the data in a directory tree immediately follows the
previously read block. Assuming that the top level directory inode is already
cached, we proceed to read the first block of the directory. This requires
reading some btree metadata blocks, typically one tree node and one data
extent leaf (dleaf) in this case, followed by reading the directory data
block itself. So the ideal layout places these first three blocks in the
order: node, leaf, data.
The directory block contains entries that reference inodes. For the ideal
layout, we require that the storage order of directory entries corresponds
exactly to the referenced inode numbers. Now we need to read the inode of the
first directory entry in the block. That requires reading an inode table
block (ileaf) into cache, which in turn typically requires reading one more
btree node. So the next two blocks in our ideal layout should be a inode table
index block and the inode table block that contains the directory entry's
inode. The ileaf should be followed immediately by a data tree index node and
a dleaf, then all the data blocks of the file. If the file has so many data
extents that more dleaf blocks are needed, these will be interleaved with file
data extents so that each is encountered during recursive traversal exactly
Continuing the traversal, more directory entry blocks and inode table blocks
are needed from time to time, and as above, the ideal layout places these and
their associated index blocks in position to be read just when needed, with no
It would be wonderful if we could always lay out data on disk in exactly this
ideal pattern, however practical issues get in the way. We are allowed to go
back and rewrite file data at any point, and even if the size is unchanged, we
will not be able to write the new data at the original location because that
would violate Tux3's atomic commit guarantee: the new data must be durably
recorded on media before any old data may be overwritten. So file rewrites
always move data blocks to new locations, and usually move a few metadata
blocks as well. So much for our ideal layout aspirations.
Deletes also leave gaps in our ideal layout, consisting of free data blocks,
freed directory records, and gaps in the inode table, in addition to
relocating a few metadata blocks. This is called "aging".
Our challenge is to define algorithms that tend to maintain a layout similar
in some sense to our ideal layout, as a volume ages. I will now briefly
describe some of the tools I have identified to help us implement some
initial approximations of production quality layout algorithms.
Our primary tool is the "allocation goal", that is, the region of the volume
we will search for free space to store some metadata block or data extent. In
many cases, the best place to search is immediately after the last place we
found. In fact, the ideal layout above can be produced by simple sequential
allocation, provided that each free space search is performed in the order the
blocks or extents will be needed during read traversal.
A nice line of attack is to arrange the steps of our disk update algorithms
appropriately, so that a simple, sequentially allocation goal produces the
desired ideal layout, or an approximation of it diluted mainly by skipping
over previously allocated blocks. To support this, we maintain a "goal"
variable, which points to the next block after the last one allocated.
Pure sequential allocation is generally satisfactory only for initial writes
into an empty region. For rewrites, we must establish a new allocation goal or
we will surely suffer from undesirable behavior where rewritten data migrates
upward endlessly, leaving any former neighbors far away. This suggests a
seemingly workable general strategy: allocate sequentially unless we detect a
good reason to jump the allocation goal to some new position. Hopefully, we
will allocate sequentially most of the time, which would be both efficient and
nearly ideal in many cases.
Three Way Interleaving
In an ideal layout, three different kinds of filesystem object are woven
together: directory blocks, inode table blocks and data blocks, each with
their associated index metadata. It is therefore desirable that inode
allocation order, data block allocation order, and directory entry order
all correspond well, and that associated blocks all lie close to each other.
This is just another way of saying that layout should be close to ideal,
but expressed in terms of the main objects involved, each of which tends to
be handled by its own constellation of code that we now must tie together
in some reasonable way.
The main tool I propose for tying inode allocation order to block allocation
order is simply to assume some fixed, linear distribution of inodes
throughout a volume. For now, we assume one inode per block, Under real world
loads, the actual inode density could easily be less than that, when files
are large. If we implement some optimization to store small files entirely in
the inode table, we could conceivably see more than one inode per volume
block. We do not need to worry much about the former case, because it is easy
to leave gaps in the inode table to limit the inode population. In the
latter case, we might simply allocate some inode numbers away from their ideal
positions, or we can use a technique I call "folding", which I will describe
in a future post.
Inode number corresponds to block allocation goal
In brief, I propose that layout generally be driven by the inode number of
the containing directory, which in turn corresponds to the allocation goal of
the first directory block of the parent directory (and associated index
blocks). The inode number goal of each regular file in the directory is
chosen by linear interpolation based on the offset of the file from the
beginning of the directory. The block allocation goal for each regular file
is just its inode number. In this way, three kinds of filesystem object are
tied together in a rough correspondence.
The accuracy of this correspondence depends on several crude assumptions, the
most significant being that the inode number of the directory is chosen
appropriately. This must necessarily be a guess, which will sometimes be
very wrong, perhaps placing a large directory in a region with insufficient
space. To deal with this and similar issues, we introduce the notion
of allocation "bounces". If our first choice based on the above relationships
fails due to lack of space, we bounce the search to some new place according
to a repeatable formula, so that neighbors that also bounce will tend to
bounce to nearby places.
Bouncing can continue over potentially many iterations if each successive
bounce happens to land in a densely populated region. In the limit, the bounce
algorithm must cover the entire volume, so that a volume may be filled
entirely, at the cost of increased fragmentation as the probability of
bouncing increases. As we bounce, we may decide to satisfy an extent
allocation in multiple fragments, in preference to allocating all of it
further away. The details of such heuristics are less important than the fact
that bouncing guarantees to satisfy any allocation request, provided
that sufficient free space remains on the volume.
Excessive bouncing may become an issue if the initial choice of directory
inode number is unlucky. In that case, we may override the default goal by
adding a goal attribute to the directory inode, to retarget all allocations
within that directory to some region with sufficient free space. New files
added to the directory will be located at the new goal, while rewrites of
existing files will eventually relocate some of the older files to the updated
goal as well.
We might contemplate implementing automatic migration of existing files in the
event an allocation goal is overridden, however this will need to be done with
care, because the additional IO traffic could well do more harm than good.
Goal setting by interpolation based on inode number breaks down in the
presence of subdirectories, which may be arbitrarily large. While storing an
entire subdirectory complete with all its file data and its own
subdirectories inline within the parent directory is necessary to obtain an
ideal layout, we do not lose much be relaxing that requirement and "bumping"
each subdirectory to some region outside its parent directory. On recursive
read, this only introduces two new seeks, one to go to the subdirectory and
one to return to the parent.
Considering the cases: the subdirectory may have many files, in which case the
two additional seeks do not cost much in comparison the traversal, or the
subdirectory has few files, in which case the total read load is not likely to
be very heavy. The pathological case, many tiny directories, might not be
worth worrying about, because in the real world the purpose of creating a
directory is usually to store more than one thing there. Another way of
putting it is, we may be willing to sacrifice a few seeks in the name of
simple layout algorithms. Over time, we may find there is something to be
gained by relaxing this bumping behavior and storing entire subdirectories
inline when convenient. For now, I propose the concept of "fully bumped"
directory layout, with details of how to choose appropriate bumping goals left
as an open question.
In order to interpolate allocation goals effectively and efficiently, we need
to make some assumptions about the average size of directory entries and
files. For this, we may just pick some arbitrary plausible values, such as 20
bytes for an average directory entry and 32K for an average file. So, for any
given new directory entry, we estimate its inode number goal by dividing its
offset within the directory by the average directory entry size, and adding
to the inode number of the directory, giving both an inode number and a block
allocation goal, according to the proposed one-to-one correspondence between
inode number and block number.
Such assumptions might prove to be wildly wrong according to actual activity
on the filesystem. We might contemplate a "learning" algorithm, where we
iteratively improve our estimates according to observed behavior, or we
might simply rely on our bouncing algorithm to resolve collisions, hopefully
degrading layout efficiency only slightly.
Inode Number Clustering
At present, Tux3 employs a simple inode table leaf format that does not index
each inode individually, but instead indexes the leaf block as a whole, with
contained inodes laid out linearly from there. The indexing space thus saved
in the leaf is considerable, perhaps 8-10% reduction in total inode leaf size.
However, this introduces a bias in favor of dense inode number packing, in
the absence of which it is possible for each inode table block to end up with
just one inode in it, or about 2% leaf fullness, which would be a problem. To
combat this, we introduce a technique called inode number clustering, where
inode allocation goals falling within a given, fixed size region, are biased
towards the lower end of that region, helping to control leaf fragmentation.
When it comes to filesystem integrity checking, inode number clustering has
important benefits beyond simply supporting a simple leaf layout scheme and
reducing average leaf size. To verify that every inode has a correct link
count, we can walk all the directories, incrementing link counts for all
referenced inodes, then walk the inode table checking that each link count
matches its respective inode link count attribute. A large table is required
to record the counts. If allocated inodes tend to be densely packed with
large intervening gaps then a radix tree would be a very efficient structure
for this. Otherwise, we might need to consider some less efficient structure
such as a fully associative btree, at the cost of considerably larger memory
With its nondestructive atomic update model, Tux3 faces new and largely
unexplored challenges in terms of optimizing disk layout. Because data and
metadata must migrate to new locations on every rewrite, Tux3 could prove
considerably more sensitive to aging related issues than Ext4, for example.
However, with careful attention to detail, it is likely that such issues can
be controlled to the extent that Tux3 layouts are competitive with Ext4
more often than not. On the positive side, Tux3 is able to place inode table
blocks near their related directory entry and file data blocks, which could
go a long way towards evening the odds.
It is possible that by careful provisioning of free space near allocated
items, the layout degradation cost due to constant repositioning may prove
negligible. (Note that Ext4 also tends to provision a significant amount of
free space, in anticipation of future directory growth.)
If it turns out to matter, Tux3 supports a per file data overwrite mode,
relaxing its normally stringent file data atomicity guarantee. Though so far
we have been unable to detect any performance improvement from that, it is
possible that our test coverage has simply not been sufficient. We may find
later that such a default mode for file data improves aging behavior without
unduly impacting crash consistency. Or ideally, we may even find that aging is
not typically degraded, so that Tux3's stronger than usual integrity
guarantees are gained essentially for free.
 Many layout optimizations for spinning disk either also help solid state
layout or do not harm it. For example, minimizing read seeks also tends to
improve block packing of related data, which is "good" sharing, the opposite
of false sharing. This reduces write multiplication in the SSD flash
translation layer by reducing the number of long lived disk sectors that would
otherwise need to be copied to new locations prior to block erasure.
 A potential reason that Tux3 might not be able to achieve the same
performance level as other hypothetical filesystems under certain loads is
that it does not embed inodes in directories. To ameliorate this, Tux3
requires a relatively more complex allocation strategy to interleave
directory and inode table blocks efficiently. In some cases it might be
impossible to avoid extra disk reads, because each name lookup in a directory
entry block must be followed by an inode lookup in an inode table block. In
the case of an isolated, random name lookup with cold cache, an extra read
is clearly required, though perhaps not an extra seek due to disk track
caching. This can fairly be regarded as a rare case, and measurably slower
only if the random read load is massive. Against this, there are certain
advantages to a separate inode table, including a potentially smaller cache
footprint for some fsck steps, and higher performance for pure namespace mass
operations such as find and ls -R, which benefit from a densely packed
namespace with no inode data intrusions.
More information about the Tux3