[Tux3] Design note: Log, commit, rebuild

Sun Jan 4 01:16:36 PST 2009

This Christmas, instead of sugar plums dancing through my head, there 
were thoughts of logging and atomic commit.  As I would much prefer the 
plums[1] the best way to put this situation right is to commit some 
code, which has begun.  However, writing things down is also an 
effective way to put those geeky thoughts out of mind, and it helps 
make good code, so here goes.

The atomic commit design has been adjusted a little to take advantage of 
the fact that we are now running stable using ->get_block, the 
traditional Ext2-style block library interface.  As a big fan of 
incremental development, I prefer to add bits and pieces to working 
code and move incrementally in the direction of Ext3-class recovery 
characteristics, rather than embarking on a big rewrite that only 
becomes useful when every last bit of it is in place.

We can call the way Tux3 is running now "immediate mode", because blocks 
are allocated and write transfers start immediately inside sys_write.  
We get this "for free" just by hooking up our tux3_get_block method, 
which Hirofumi did last month.  After pondering a while, I concluded 
that we can make some fairly small additions to turn immediate mode 
writing into a real atomic commit, leaving the much nicer  cache 
layering design for later.

The first of these additions is a logging mechanism, like I have 
described for Tux3 from the beginning, but pared down a little.  We do 
not need forward logging, which is just an optimization, but we do need 
the concept of log rollup, which is not just an optimization, it solves 
some fundamental operational issues.

The first thing we will log is changes to the allocation bitmap.  
Bitmap blocks will not be written to disk at the time blocks are 
allocated in sys_write, sys_create, etc.  Instead, allocated and freed 
extents will be logged, while the bitmap blocks themselves are only 
updated in cache.  From time to time, the cached blocks will be written 
out, fullfilling the "promises" in the log blocks.

This mechanism helps solve two fundamental problems.  The first is 
delayed freeing: when a delta update frees some blocks those blocks 
must not become available for reallocation before the delta commit has 
finished, otherwise some blocks of the previous consistent filesystem 
state may be overwritten, and a crash at that point would leave the 
filesystem with the old state corrupted and the new state not 
completed.  So any blocks freed in a delta are entered into the 
allocation map after the delta has completed.  The problem is, if we 
crash just after the commit, nobody knows about those freed blocks any 
more, they have leaked.  This issue is solved tidily by logging the 
freed blocks (extents) as part of the same delta that frees them.

The second fundamental problem is "allocate in allocate": we flush the 
bitmap to disk, which causes a physical block to be allocated for some 
dirty bitmap buffer, which dirties a different bitmap block, which then 
needs a physical block allocated for it, and so on, potentially for 
many iterations.  This makes it hard to know the upper bound of 
metadata needed to perform a disk update, which in turn makes it hard 
to handle ENOSPC accurately.

Another issue is, an allocation map might be in flight to disk when a 
bit is changed, making the state of the bit on disk unpredictable.   
And furthermore, logging a bitmap allocate record for a bit that is 
already set will cause an error on replay, because replay will check 
that the bitmap bits it sets are not already set.  So in short, this is 
a big mess.

Now I will introduce a technique I call "buffer forking" to solve the 
problem cleanly.  Buffer forking is a form of copy-on-write, in which a 
copy is made of a buffer's underlying data and substituted for the 
original.  Then the buffer contents can be altered freely without 
affecting the original. 

Forking makes flushing the allocation bitmap to disk easy.  We just walk 
all the dirty buffers of the bitmap inode calling our tux3_get_block 
method to do physical allocation.  Any bitmap block that is altered 
during the flush will be forked, and any allocations will be logged.  
Then we walk the same dirty buffers to submit the IO.

The allocate-in-allocate issue comes up because we use dynamically 
allocated bitmaps as opposed to the Ext filesystem family, which has 
them at fixed locations.  However, it isn't too hard to deal with and 
there are big advantages: for one thing, we can implement atomic update 
for bitmap blocks by redirecting them to new locations by the same 
mechanism as any other Tux3 filesystem object.  If they had fixed 
locations we would need a different, update in place method that is 
more code and more writes.

[1]  http://www.christmas-tree.com/stories/nightbeforechristmas.html 
http://earthmother-intheraw.blogspot.com/2008/12/visions-of-sugar-plums.html

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3