For review: draft Tux3 design notes

Tue Aug 14 12:43:20 PDT 2012

This covers Tux3's cache and atomic commit model, with some omissions. Hopefully
a clear overview. Some details of superblock and metablock handling are described
possibly for the first time here, which may be regarded as proposals. Comments
and corrections please.

-----

Cache Model

Tux3 introduces a novel model for the atomic transfer of cache updates to stable storage.
When mounted, a Tux3 filesystem is rooted in cache so that some dirty cache blocks
take precedence over blocks stored on disk. The application's view of the filesystem is
always of that tree rooted in cache. From time to time, Tux3 transfers a "delta" to
disk in such a way that no blocks of the filesystem tree rooted in the filesystem
superblock are ever overwritten until the transfer is complete. The filesystem tree
rooted on disk represents the filesystem state as of the previous delta commit. After
completing each delta, some blocks of the previous delta may become freeable and are
added to the freespace map accordingly, so that they may be reused.

Atomic Commit Model

The main elements of the Tux3 atomic commit model are:

  - Superblock and metablock, which reference the root of the inode table and the
    beginning of the log block chain.

  - Log block chain, which records differences between the filesystem tree rooted
    in the inode table and dirty cache blocks as at the completion of the most
    recent delta commit

  - A delta, the set of all images of dirty blocks or log block entries needed
    to reconstruct the dirty blocks of a point in time snapshot of cache.

  - A rollup procedure, where changes recorded in the log chain against the on-disk
    filesystem tree are written to the filesystem tree, possibly causing new log
    blocks to be added to the current delta.

At any given instant in the lifetime of a mounted filesystem, Tux3 has:

  1) A partially consistent filesystem tree recorded on disk.

  2) A fully consistent filesystem tree rooted in block cache, referencing
     the above partially consistent filesystem tree recorded on disk.

  3) A chain of log blocks recorded on disk sufficient to reconstruct all
     dirty blocks of the above block cache as at the time of the most recent
     delta commit.

Normally, a consistent state of a Tux3 filesystem tree always consists of
result of applying a replay operation to the on-disk log, resulting in a
consistent filesystem tree rooted in cache. This is true even following an
unmount of a modified filesystem. In other words, Tux3 performs a log replay on
every mount, thus exercising the replay mechanism and providing confidence that
the replay mechanism can be expected to function correctly on restart follwing
an unexpected interruption. This design principle also improves system shutdown
speed in normal operation.

For the time being, all dirty block cache is committed to disk on each delta.
In future this principle may be relaxed to permit some inodes to cache dirty data
across deltas in cases where data safety is judged to be less important.

Superblock and Metablock

The superblock stores invariant global volume data

  - Volume uuid
  - High precision volume creation time (base for relative time in inode attributes)
  - Block size
  - Total volume blocks
  - Metablock locations

A copy of the superblock is stored at the base of each block group, if the block group
contains any data, otherwise the superblock location at the base of the block group is
filled with zero. Rationale: this supports rapid initialization of large, sparse
volumes while increasing superblock redundancy only as volume fullness increases.

The main purpose of the metablock is to store the starting location of the log block
chain, which changes on each commit. A secondary purpose is to cache statistics that
could otherwise be obtained from a full volume scan.

At each commit, the metablock is stored into one of a small set of known locations
recorded in the superblock. At mount time, all of the metablock locations are read,
and the one with the highest sequence number modulo the sequence number range is the
current metablock. One of the metablock locations contains the current metablock, which
has the highest sequence number, modulo the sequence number range.

To find the highest metablock sequence number, read all the metablocks, choose any
existing sequence number as a reference, then find the sequence number with the largest
positive difference from this reference modulo the sequence number range.

As an optimization for rotating media, the metablock is stored into a metablock
location that does not hold the immediately previous metablock and is nearest the most
recent write location, except when a stale metablock sequence number needs to be cleared,
in which case the metablock location with the oldest sequence number is chosen.

Note the lack of metablock redundancy, as opposed to superblocks. The rationale is that
superblock redundancy costs little while greatly improving the chance that global
volume information including volume block size can be recovered from a damaged volume
with certainty. In contrast, metablock redundancy would increase per-commit overhead
while providing extra protection for information that could be recovered from a global
volume scan in any case. (Metablock redundancy could be provided as a future option
simply by permitting duplicate sequence numbers for duplicate metadata blocks.)

Delta commit

<coming soon>

Log rollup

<coming soon>