[Tux3] Design note: Delta staging

Thu Dec 4 21:42:10 PST 2008

In a previous design note, I described Tux3's frontend/backend cache 
model:

   http://mailman.tux3.org/pipermail/tux3/2008-November/000303.html
   "Deferred Namespace Operations"

The subject line is a little inaccurate - that post is really more about 
Tux3's cache layering, out of which came the conclusion that the 
concept of deferred namespace operations is necessary in order to 
implement Tux3's layered model optimally.  (The above needs a cleanup 
and repost as a design note specifically about the cache model.)

This post is about an important piece of the cache layering model: 
the "delta staging" operation (called "delta setup" in the post above) 
which takes place at each delta transition.  Delta staging takes the 
changes that the user has made to front end cache and formats them into 
disk blocks, ready to be transferred to disk.  Since these disk blocks 
are all buffered in cache, this is a cache-to-cache operation.

Delta staging does the following:

  - Flushes deferred name creates and deltas to directory blocks (*)

  - Flushes deferred inode creates and deletes to inode table blocks (*)

  - Allocates and assigns physical disk addresses for dirty data blocks
    and dirty directory entry blocks

  - Updates inode table blocks to point to new locations of the above
    blocks

  - Assigns new locations for any modified inode table blocks

  - Assigns new locations for split index blocks (note: but not for
    index blocks that just have a new pointer added, that is handled
    by "promises" and rollup described earlier.)

  - Allocates and formats one or more log commit blocks to record the
    physical locations of all blocks now contained by the delta,
    including dirty file data, changed directory entry blocks and split
    btree nodes

  - Chooses one of the log commit blocks to be the delta commit block
    and adds delta commit information to it.

Steps marked (*) above are done only if we have deferred namespace 
operations available.  Otherwise, the front end will directly modify 
directory entry and inode table blocks, using a lock against delta 
staging to avoid the situation where the front end wants to modify an 
inode table block that delta staging has already modified but not yet 
written.

Note that a delta does not include changed bitmaps or changed index 
blocks, other than split index blocks.  This information can be derived 
from the delta commit blocks at replay time (next mount, whether 
explicit or after unexpected interruption) so there is no need to write 
this to disk.

Delta staging will normally be a quick, cache to cache operation with 
running time measured in microseconds.  However, if it needs to read 
metadata blocks from disk then its running will sometimes run into 
milliseconds.  With deferred namespace operations, this will causes no 
visible interruption to the front end, provided the disk is not 
backlogged.  Without deferred namespace operations, these latency 
spikes will be visible to the user, which sounds bad, but is actually 
normal behavior for the current generation of Linux filesystems.  With 
deferred namespace operations, we will push the envelope a little in 
Tux3.

When delta staging completes, we have a set of block images ready to 
transfer to disk.  As soon as the previous delta has completed (that 
is, its delta commit block write has completed) all the blocks of the 
new delta are submitted for writeout, except for the delta commit 
block.  When the other blocks have completed writeout, the delta commit 
block is written and transfer of the next delta can begin.

This is a pretty simple strategy.  In time we will will elaborate it 
with measures to eliminate the two synchronous waits at each delta: 1) 
the commit block waits for the other block writeouts to complete; 2) 
the next delta waits for the commit block writeout to complete.  These 
waits can be eliminated with some simple tricks.  Even if we are lazy 
and do nothing about these waits, performance will be respectable, 
because under load the deltas will be quite large.  Ten milliseconds 
spent waiting for a 500 millisecond delta writeout will be scarcely 
noticeable.  But a synchronous write load, like we have with NFS, will 
benefit visibly from a smarter commit pipeline strategy.