[Tux3] Design note: Log, commit, rebuild
Daniel Phillips
phillips at phunq.net
Sun Jan 4 01:16:36 PST 2009
This Christmas, instead of sugar plums dancing through my head, there
were thoughts of logging and atomic commit. As I would much prefer the
plums[1] the best way to put this situation right is to commit some
code, which has begun. However, writing things down is also an
effective way to put those geeky thoughts out of mind, and it helps
make good code, so here goes.
The atomic commit design has been adjusted a little to take advantage of
the fact that we are now running stable using ->get_block, the
traditional Ext2-style block library interface. As a big fan of
incremental development, I prefer to add bits and pieces to working
code and move incrementally in the direction of Ext3-class recovery
characteristics, rather than embarking on a big rewrite that only
becomes useful when every last bit of it is in place.
We can call the way Tux3 is running now "immediate mode", because blocks
are allocated and write transfers start immediately inside sys_write.
We get this "for free" just by hooking up our tux3_get_block method,
which Hirofumi did last month. After pondering a while, I concluded
that we can make some fairly small additions to turn immediate mode
writing into a real atomic commit, leaving the much nicer cache
layering design for later.
The first of these additions is a logging mechanism, like I have
described for Tux3 from the beginning, but pared down a little. We do
not need forward logging, which is just an optimization, but we do need
the concept of log rollup, which is not just an optimization, it solves
some fundamental operational issues.
The first thing we will log is changes to the allocation bitmap.
Bitmap blocks will not be written to disk at the time blocks are
allocated in sys_write, sys_create, etc. Instead, allocated and freed
extents will be logged, while the bitmap blocks themselves are only
updated in cache. From time to time, the cached blocks will be written
out, fullfilling the "promises" in the log blocks.
This mechanism helps solve two fundamental problems. The first is
delayed freeing: when a delta update frees some blocks those blocks
must not become available for reallocation before the delta commit has
finished, otherwise some blocks of the previous consistent filesystem
state may be overwritten, and a crash at that point would leave the
filesystem with the old state corrupted and the new state not
completed. So any blocks freed in a delta are entered into the
allocation map after the delta has completed. The problem is, if we
crash just after the commit, nobody knows about those freed blocks any
more, they have leaked. This issue is solved tidily by logging the
freed blocks (extents) as part of the same delta that frees them.
The second fundamental problem is "allocate in allocate": we flush the
bitmap to disk, which causes a physical block to be allocated for some
dirty bitmap buffer, which dirties a different bitmap block, which then
needs a physical block allocated for it, and so on, potentially for
many iterations. This makes it hard to know the upper bound of
metadata needed to perform a disk update, which in turn makes it hard
to handle ENOSPC accurately.
Another issue is, an allocation map might be in flight to disk when a
bit is changed, making the state of the bit on disk unpredictable.
And furthermore, logging a bitmap allocate record for a bit that is
already set will cause an error on replay, because replay will check
that the bitmap bits it sets are not already set. So in short, this is
a big mess.
Now I will introduce a technique I call "buffer forking" to solve the
problem cleanly. Buffer forking is a form of copy-on-write, in which a
copy is made of a buffer's underlying data and substituted for the
original. Then the buffer contents can be altered freely without
affecting the original.
Forking makes flushing the allocation bitmap to disk easy. We just walk
all the dirty buffers of the bitmap inode calling our tux3_get_block
method to do physical allocation. Any bitmap block that is altered
during the flush will be forked, and any allocations will be logged.
Then we walk the same dirty buffers to submit the IO.
The allocate-in-allocate issue comes up because we use dynamically
allocated bitmaps as opposed to the Ext filesystem family, which has
them at fixed locations. However, it isn't too hard to deal with and
there are big advantages: for one thing, we can implement atomic update
for bitmap blocks by redirecting them to new locations by the same
mechanism as any other Tux3 filesystem object. If they had fixed
locations we would need a different, update in place method that is
more code and more writes.
[1] http://www.christmas-tree.com/stories/nightbeforechristmas.html
http://earthmother-intheraw.blogspot.com/2008/12/visions-of-sugar-plums.html
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list