[Tux3] Thinking about Syncing

Sun Oct 12 04:49:10 PDT 2008

However much we might want to wish them away, spinning disks are far 
from dead as of this writing and are likely to continue to annoy us for 
another decade or more before eventually receding into a largely 
archival role.  But even in such a happy world, we cannot just forget 
about efficient access to our archives, and even though it will tend to 
occur over a longer time scale, we still will need to worry about 
controlling fragmentation as we slowly churn the data in our near-line 
storage archives.  In other words, spinning media with all its issues 
will never actually die, it will just slowly fade away.

This is all by way of saying that Tux3 must work well with rotating 
media if it is to establish a foothold as a general purpose filesystem.  
And also by way of saying that for the next few weeks I expect to spend 
pretty much all my Tux3 design energy wallowing in this one obsession.  
On the bright side, the problems we need to grapple with can get quite 
involving and subtle for those whose interest runs in that direction.

My immediate concern is how to get a massive number of buffers dirtied 
by filesystem activity onto disk quickly and safely.  By "safely" I 
mean ACID:

  A)tomic

     - Either all the disk updates for a given filesystem operation
       become part of the stable disk image or none of them do.

  C)onsistent

     - There is no instant in time when an inconsistent filesystem
       image is recorded on disk.

  I)solated

     - Even though we did not write data to disk in exactly the order
       the filesystem operations occurred, it seems like we did.

  D)urable

     - After fsync of a file, any data written to the file is safe
       from power loss.

By "quickly" I mean something resembling media speed, even with lots of 
small files.  The ideal write pattern will move the disk head linearly, 
progressively depositing a mixture of blocks that have been recently 
dirtied in cache by filesystem operations.  Untarring a 2.6.26.5 kernel 
tree to a Tux3 filesystem will yield the following mix:

  - File data blocks (56%)
  - File index btree root and leaf (43%)
  - Inode table blocks (.3%)
  - Dirent blocks (.2%)
  - Commit blocks (.1%)

The commit blocks make up our "forward log" and logically state how each 
of the other blocks is to be inserted into the filesystem tree, without 
actually performing any insertions.  To prevent log chains from 
becoming so long that a significant amount of time is required to 
reconstruct the cache state on filesystem startup or resume from crash, 
we will perform periodic rollups, where our logical log "promises" are 
crystallized as actual index block updates, performed atomically of 
course.

How often we must perform such rollups is a good question.  I will 
arbitrarily assume we do it once a second in the case of a heavy untar 
load, at which time we need to insert about 50 pointers to inode table 
blocks, 40 pointers to dirent blocks and sometimes a pointer to a new 
bitmap block.  These are all nicely localized, most probably affecting 
one inode table index block, one directory file index block and one 
bitmap file index block.  Together with the one bitmap block we have 
dirtied, that is just four blocks to update.  To update these blocks 
atomically, we write them to a location of our choosing, which could 
possibly be in line with our stream of data blocks, or perhaps more 
optimally for read performance, a few tens of tracks back towards the 
beginning of the disk.  Supposing we choose the latter, we can expect 
to spend 10 ms or so seeking to our chosen location, writing the 
handful of index blocks and a commit block to keep track of them, and 
back to where we will deposit the next second's worth of untarred 
files.  This 10 ms out of line write once per second (.1%) will 
scarcely affect out write overhead at all.

The commit block format can be very simple.  For each data block in the 
transaction, including dirent blocks, we log the triple: {data block 
address, btree leaf block address, index offset}.  For each inode table 
block we log the triple: {inode table block address, inode table index 
leaf address, index offset}.  These log entries fit nicely in 16 bytes 
each, allowing nearly 256 per commit block, which defines the maximum 
size of a transaction at this level.  By implication, all the object 
blocks owned by the transaction are newly allocated, so this fact does 
not have to be explicitly recorded.

The commit block also records the forward log chain link, which will 
cause no measurable overhead unless we implement a paranoid mode that 
does not rely on checksumming to know when a commit has successfully 
reached disk, in which case we will need some periodic out of line 
seeks to record log chain commits at predetermined locations.  Under 
the heavy untar load being considered here there is no reason to do 
this more than about once a second, and the 12 ms or overhead will not 
affect performance a lot.

Let's take a step back now and consider the marvelous way that we really 
do not have to do much of anything in order to achieve nearly perfect 
linear write behavior for the massive untar case, complete with atomic 
commit.  It may all seem obvious in retrospect, but it took years to 
arrive at this level of simplicity.  The key insights are:

  * We can write data and index blocks anywhere we want.

  * We do not need to write our transaction log at any predetermined
    position.

  * We do not have to write the parent pointer to a block at the time
    we write the block, we just need to log a promise to write the
    parent pointer at some time in the future.

In the analysis above, we wastefully write out a btree index root and 
leaf for each file, whereas the Tux3 design allows for direct pointers 
to data blocks from the inode table block, for small files.  This 
optimization will be postponed until after the initial kernel port, 
since it is not very interesting from the algorithm point of view.  But 
when we do get around to it, our untar load will take a dramatic turn 
for the better:

  - File data (99%)
  - inode table blocks (.6%)
  - dirent blocks (.4%)
  - commit blocks (.1%)

In other words, we expect to approach media speed closely for this 
particular load, which is likely going to be the first test users will 
try when they get their hands on Tux3.  We will do even better when 
immediate file data support is added, where file data is stored 
directly in the inode table block if possible, which saves space and 
transfer time by packing small files together, thus reducing external 
fragmentation.

In this analysis I have glossed over the possibility of losses due to 
bio setup overhead, driver overhead, cpu latency in the filesystem 
itself and locking.  We do expect such losses, probably in the low 
single digit percentages.  With this in mind I think that we can 
realistically aim at 90%+ of media speed for the untar case, which if 
achieved will be considerably better than is possible for a journalling 
filesystem.

There are far more challenging write loads we also need to deal with, 
such as rewrite of an untar tree.  Ext3, whose write performance is far 
from shabby, takes longer to rewrite a kernel untar by a factor of 
about 2.3.  Tux3 has the additional challenge that if a snapshot has 
been set in the interim it has to find an entirely new place to store 
the kernel untar that hopefully will not cause horrible fragmentation 
if we repeat the process over and over again.  Maybe if we are clever 
in our choice of target location for the rewrite, we can get about the 
same performance as the virgin untar, which would be wonderful.

Then there is the random write case, slow growing log case, slow growing 
directory case, completely random writes, and a whole lot of other 
nasty write performance optimization puzzles that will rear their heads 
as we progress.  Sigh.  But now we have a pretty good idea how the 
thundering herd of small files case will work, and that is looking very 
good.  Solutions for the other, more subtle cases will be developed 
from the basic model described here.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3