[Tux3] Design note: Log replay

Sat Jan 17 19:34:01 PST 2009

The best way to understand the various remaining changes that must be 
made to the current Tux3 code base to achieve atomic commit is to 
consider how log replay works, its purpose, assumptions and data 
structures.

For efficiency, Tux3 has the notion of pinned metadata where it does not 
write out changed bitmap btree blocks on each delta, but emits log 
entries describing the changes, which are part of a committed delta.  
This idea is extended to btree blocks and inode table blocks, so that 
normally, all metadata changes are logged, and only data blocks are 
written out in full.  This strategy has a number of desirable effects, 
including:

  * Minimum number of blocks written for a delta is two (one log block
    and the superblock) which will later be reduced to one.  The low
    per-delta metadata overhead will be good for synchronous write
    loads such as sync mounts and NFS.

  * Minimizes seeking by allowing metadata updates to be logged near
    the associated data, and later moved to their proper position in
    the filesystem tree in a bulk transfer.

  * On average, reduces total transfer bandwidth by consolidating
    multiple small changes into a single update of a metadata block.

  * Simplifies the cache layering model to prepare for a commit
    pipeline with multiple deltas in flight.

Tux3 commits each delta by writing some set of dirty cache blocks to 
disk, and some log blocks to define how to reconstruct any remaining 
dirty metadata blocks not flushed to disk.  When all those blocks have 
landed on disk, the physical location of the log block containing the 
delta commit entry is recorded in the Tux3 superblock (later, 
indirectly via a "metablock", or implicitly by a forward log entry).  
On restart, whether planned or unplanned, Tux3 reloads the log blocks 
and replays them to reconstruct any dirty metadata blocks as at the 
time the last completed delta was staged.

Each log block contains a number of log entries, aka promises, serially 
encoded in big endian order much like inode attributes, and in fact, 
using the same encode/decode primitives.  There are two kinds of log 
entries: physical and logical.  Physical entries are for btree blocks 
cached in the volume cache and logical entries are for blocks cached 
logically in file page cache.  Each kind of metadata block that can be 
pinned has its own specialized log entries.

Physical:

  * Inode table leaf (ileaf)
     - insert child
     - delete child
     - update child

  * Data index leaf (dleaf)
     - insert child
     - delete child
     - update child

  * Btree index block (bnode)
     - insert child
     - delete child
     - update child

Logical:

  * Allocation map
     - set/clear bit range

  * Xattr Atom table
     - change atom use count

Each log entry refers to one or more on-disk blocks.  Physical entries 
refer to physical block locations such as btree nodes and logical 
entries refer to logical blocks, such as bitmap blocks.  On replay, 
physical cache must be reconstructed to obtain current btrees before it 
is possible for map_region to determine the physical locations of 
logical blocks.  Log replay therefore is done in two passes:

  Physical replay:

    - Replay all physical entries, ignoring logical entries
    - After this, map_region is able to access logical blocks

  Logical replay:

    - Replay all logical entries, ignoring physical entries
    - After this, block allocation and free is possible, Tux3 is
      ready to run.

This is efficient, because log blocks are cached in their own mapping, 
they do not have to be read from disk on each pass (sb->logmap).

Tux3 has the notion of retiring a pinned metadata block, which means 
flushing the block as part of a delta and writing a "retire" entry to 
the log.  This is done to shorten the log for fast replay, and to avoid 
filling cache with pinned metadata.  There are physical retire entries 
for the volume cache and logical retire entries specific to logically 
mapped blocks.

A retire entry implicitly fullfills all prior promises in the log.  A 
retired promise can be skipped in replay, and in fact must be skipped 
because the block it refers to may have been deleted and reallocated to 
some other purpose.  So before applying each log entry, replay checks 
that the target block has not been retired.  It is easy to do this 
inefficiently with an algorithm order n^2 in length of log, slightly 
more work to do it efficiently.

The location of each log block is encoded in the next log block, to form 
a reverse chain on disk.  The total length of the chain and the 
physical address of the delta commit block are encoded in the 
superblock.  On replay, the log blocks are read into Tux3's logmap 
logical mapping in reverse order, which can be considered a design 
deficiency because it incurs one synchronous read for each log block, 
so log load time is (N * seektime) where seektime can be several 
milliseconds.  The log chain cannot be very long before this becomes 
noticeable.  This will be corrected later with the introduction of 
metablocks, which also avoid frequent superblock updates to record the 
beginning of the log chain.  This improvement is an optimization that 
does not affect the logic of atomic commit, so will be deferred for the 
time being.

Following replay, logging simply continues in the logmap after the point 
of the last delta commit.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3