[Tux3] Design note: Log replay
Daniel Phillips
phillips at phunq.net
Sat Jan 17 19:34:01 PST 2009
The best way to understand the various remaining changes that must be
made to the current Tux3 code base to achieve atomic commit is to
consider how log replay works, its purpose, assumptions and data
structures.
For efficiency, Tux3 has the notion of pinned metadata where it does not
write out changed bitmap btree blocks on each delta, but emits log
entries describing the changes, which are part of a committed delta.
This idea is extended to btree blocks and inode table blocks, so that
normally, all metadata changes are logged, and only data blocks are
written out in full. This strategy has a number of desirable effects,
including:
* Minimum number of blocks written for a delta is two (one log block
and the superblock) which will later be reduced to one. The low
per-delta metadata overhead will be good for synchronous write
loads such as sync mounts and NFS.
* Minimizes seeking by allowing metadata updates to be logged near
the associated data, and later moved to their proper position in
the filesystem tree in a bulk transfer.
* On average, reduces total transfer bandwidth by consolidating
multiple small changes into a single update of a metadata block.
* Simplifies the cache layering model to prepare for a commit
pipeline with multiple deltas in flight.
Tux3 commits each delta by writing some set of dirty cache blocks to
disk, and some log blocks to define how to reconstruct any remaining
dirty metadata blocks not flushed to disk. When all those blocks have
landed on disk, the physical location of the log block containing the
delta commit entry is recorded in the Tux3 superblock (later,
indirectly via a "metablock", or implicitly by a forward log entry).
On restart, whether planned or unplanned, Tux3 reloads the log blocks
and replays them to reconstruct any dirty metadata blocks as at the
time the last completed delta was staged.
Each log block contains a number of log entries, aka promises, serially
encoded in big endian order much like inode attributes, and in fact,
using the same encode/decode primitives. There are two kinds of log
entries: physical and logical. Physical entries are for btree blocks
cached in the volume cache and logical entries are for blocks cached
logically in file page cache. Each kind of metadata block that can be
pinned has its own specialized log entries.
Physical:
* Inode table leaf (ileaf)
- insert child
- delete child
- update child
* Data index leaf (dleaf)
- insert child
- delete child
- update child
* Btree index block (bnode)
- insert child
- delete child
- update child
Logical:
* Allocation map
- set/clear bit range
* Xattr Atom table
- change atom use count
Each log entry refers to one or more on-disk blocks. Physical entries
refer to physical block locations such as btree nodes and logical
entries refer to logical blocks, such as bitmap blocks. On replay,
physical cache must be reconstructed to obtain current btrees before it
is possible for map_region to determine the physical locations of
logical blocks. Log replay therefore is done in two passes:
Physical replay:
- Replay all physical entries, ignoring logical entries
- After this, map_region is able to access logical blocks
Logical replay:
- Replay all logical entries, ignoring physical entries
- After this, block allocation and free is possible, Tux3 is
ready to run.
This is efficient, because log blocks are cached in their own mapping,
they do not have to be read from disk on each pass (sb->logmap).
Tux3 has the notion of retiring a pinned metadata block, which means
flushing the block as part of a delta and writing a "retire" entry to
the log. This is done to shorten the log for fast replay, and to avoid
filling cache with pinned metadata. There are physical retire entries
for the volume cache and logical retire entries specific to logically
mapped blocks.
A retire entry implicitly fullfills all prior promises in the log. A
retired promise can be skipped in replay, and in fact must be skipped
because the block it refers to may have been deleted and reallocated to
some other purpose. So before applying each log entry, replay checks
that the target block has not been retired. It is easy to do this
inefficiently with an algorithm order n^2 in length of log, slightly
more work to do it efficiently.
The location of each log block is encoded in the next log block, to form
a reverse chain on disk. The total length of the chain and the
physical address of the delta commit block are encoded in the
superblock. On replay, the log blocks are read into Tux3's logmap
logical mapping in reverse order, which can be considered a design
deficiency because it incurs one synchronous read for each log block,
so log load time is (N * seektime) where seektime can be several
milliseconds. The log chain cannot be very long before this becomes
noticeable. This will be corrected later with the introduction of
metablocks, which also avoid frequent superblock updates to record the
beginning of the log chain. This improvement is an optimization that
does not affect the logic of atomic commit, so will be deferred for the
time being.
Following replay, logging simply continues in the logmap after the point
of the last delta commit.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list