[Tux3] Spacial correlation between directory entries, inodes and file data

Daniel Phillips phillips at phunq.net
Wed Aug 27 03:19:32 PDT 2008

Here is a short note to introduce some of my thinking on an important
subject: the extent to which Tux3 can achieve good spacial correlation
between directory entries, inodes and file data is the extent to which
it will be able to go head to head with other filesystems designed to
work with rotating media.

Good spacial correlation is always hard for a general purpose file
system.  Two issues make it especially challenging for Tux3:

  1) Atomic commit.  Sometimes we want to update metadata by writing
     it in a different place than the old version.  If we do that to
     numerous metadata blocks that formerly were well correlated, the
     results after the smoke clears may not be good and may get worse
     each time this process is repeated.

  2) Versioning.  We cannot overwrite data or metadata that belongs to
     any version other than the current version.  So we need to find a
     new place to write changes to such versioned data, which may not
     be well correlated with the original version, part of which may
     still be inherited by the current version.  So we may end up
     having to seek to a number of different places to read the data,
     and we also have similar problem with metadata as described above.

If this sounds scary, it should.  If it does not, then you need to read
again carefully and imagine just how bad things can get with a simple
minded allocation scheme that just hopes for the best.  In short, one
can and does see situations where a 6 ms seek is needed to access each
block of data.  That is six thousand times slower than if the blocks
could all be read linearly.  You read it correctly: nearly four orders
of magnitude of performance can be lost due to poor spacial correlation
of disk data.

Against this formidable problem we have a couple of factors in our
favor: we can allocate data and metadata anywhere we want, and we can
fairly easily move blocks around later as necessary to deal with
situations where no matter how hard we tried, we failed to place data
on disk in such a way as to be able to access it for common disk
operations with a reasonable number of seeks.  The latter is nothing
more than a fallback: it costs extra bandwidth, and relying on after
the fact block relocation may cause the disk to act in an annoying
way: continuing to chatter on to itself long after the user thinks
disk activity should have ended.  This might be particularly annoying
with a laptop, where the owner may want the disk to spin down not
too long after write activity ends.  It would be quite disconcerting
if sometimes it spun down and sometimes not.

So let us consider some techniques for laying down disk data with a
high probability of good spacial correlation, even after the data has
"aged" for some time due to continuous updating mixed with creation and
deletion of snapshots (versions) and for good measure, with the disk
nearly full.  Which is the normal operating state for most disks I
know of (well, to be sure snapshotting is new for most users, even me).

To start unravelling this hairball of a problem, we start at the
directory.  Clearly, we want the file data of all files in a given
directory to be well spacially correlated, since anybody can come
along and grep those files, for example, and expect the job to be done
in time proportional to the total data in the directory.  We also need
the blocks of the directory to be close to each other in case they are
all needed at once, which is common, and close to the file data so that
a pattern of opening plus reading a file works well, which is also
common.  And we need the inode table blocks near the directory blocks
because we need to read an inode almost every time we read a dirent.

So we start off by doing a very simple thing: we establish a mapping
between inode numbers and data blocks, and make that a very simple
relationship: the inode number is the data block goal.  We have 2**48
inode numbers and 2**48 data blocks (at most) so crudely speaking, we
have one inode number per data block.  (I will expand later on how we
may get further advantage by mapping these somewhat fanciful limits
onto the sizes of volumes that people actually have.)

We actually work backwards from the data block addresses to the inode
number.  When it is time to choose an inode number for a new file, we
consider the "center of gravity" of the directory the new file is to
be created in.  Taking the average of the addresses of all the blocks
in the directory is one crude way to come up with this number, but we
can do better than that both in terms of efficiency of computation and
effectiveness of spacial correlation.  Let us just assume that we have
such a center of gravity.  We could randomly pick one for each new
directory, for example, and store it as an attribute in the directory

We use the center of gravity not only for allocating blocks to files,
but for expanding the directory and allocating new inode table blocks
as well.  However, we do not want all three types of blocks mixed
together randomly, so we also incorporate into the allocation goal a
notion of position of a file within the directory, and position of a
block within a file.

Files come in all different sizes and in general we cannot possibly
know in advance the size of a file.  We do know that there are a lot
of big files that grow slowly, and we must take special care that data
blocks for such files do not end up woven together, as is the tendency
with naive allocation schemes.

We may not be able to predict the size of a file, but we can certainly
tell after the fact.  When a small amount of data is appended to the
end of a big file, we can reasonably expect that more data will be
appended to the same file.  We should therefore avoid allocating
metadata or data for other files in the space immediately above the
final block of a large file that has historically grown slowly.  Let
us call that the "principle of purity", and that is enough for today.

In the next installment of what should become a respectable treatment
of this subject, I will introduce the concept of generating functions
for allocation.



Tux3 mailing list
Tux3 at tux3.org

More information about the Tux3 mailing list