Fsck Revisited

Sun Jan 27 00:21:15 PST 2013

The story so far

Our prototype fsck uses a filesystem tree walking framework that was actually
actually evolved from Hirofumi's "tux3graph" graphical filesystem structure
generator. At this point, the most interesting thing our shiny new fsck does
is to verify bitmap allocation integrity:

  Verify unique block references and consistent allocation maps

   * Each block marked as used is referenced exactly once by the filesystem
     tree (possibly as part of an extent) and each block not referenced is
     marked free.

The algorithm for this is simple. Given a "shadow" bitmap file the same size
as the allocation bitmap file, initialized to zero, walk the entire
filesystem tree marking each referenced block as used in the shadow bitmap.
Then compare the allocation and shadow bitmap files: they should be equal. (A
few global blocks must be specially marked.)

This algorithm is easily extended to repair inconsistent bitmaps. Simply copy
any blocks that differ from the shadow to the allocation bitmap. This can be
done in cache, then the above check may be repeated to verify success, prior
to committing repaired bitmap blocks to disk. Note: the log may need to be
adjusted to remove pending bitmap updates. Alternatively, log entries may be
added rather than flushing bitmap blocks, creating a state consistent with a
future rollup.

This is actually a respectable part of what we need in a finished fsck. The
largest remaining issues involve directories:

  Verify directory entry referential integrity

   * Each directory entry must reference a valid inode and each valid inode
     must either be referenced by at least one directory entry or be in the
     orphan table or have an orphan entry in the log.

   * The number of directory entries referencing a given inode must be
     consistent with the inode link count attribute.

   * The type of each directory must be consistent with the type attribute
     of the referenced inode.

Algorithm

For now we can do this with a simple table, the "shadow inode number" table,
with one byte per inode number, initially zero. It is impractical to map the
entire 48 bit inode number range in this way, so we will also need to enforce
a limit on the range of allocated inode numbers. For the time being, that can
be the smallest binary power greater than the volume size in blocks, which I
call the volume "wrap size". (Volume wrap size will be important in other
contexts as well, volume resizing in particular.)

Each inode shadow map byte has two fields: a type field (3 bits) and a count
field (5 bits). The count field is too small to represent the maximum
possible link count, which in theory is the same as the number of inodes. For
practical purposes, we will therefore limit link count to less than 2**32.
Large link counts are stored in a "large" shadow inode map file, 32 bits per
inode. Large link counts will be rare, so the shadow inode map will be sparse,
and (except for pathological cases) small in terms of total blocks, even
though each entry in it might force an entire block to be allocated.

The algorithm has two passes. Pass one walks all directories in inode table
order. For each directory entry, the inode shadow map byte corresponding to
the directory entry inode number is updated. If the map byte is zero, the
inode number has not been seen before, so the map byte is set to a count of
one and the map byte type field is set to the type of the directory entry.
Otherwise, the map byte type field is checked to ensure that it corresponds to
the directory entry type, and the count field is incremented if less than its
maximum value. If the count field is at its maximum value, then the
corresponding field in the large shadow inum count map is incremented.

Pass two walks all inodes in the inode table, checking that the inode type
attribute matches the inode shadow map type and that the inode link count
attribute matches the shadow map count. If the shadow map count is at its
maximum value then the sum of the shadow map count and the large shadow map
count is used.

Pass one of this algorithm can be combined with the existing fsck inode table
walk. Pass two must be a second inode table walk. It is possible that we will
only ever need two complete inode table walks, unless errors are detected and
we decide to attempt repair.

If this algorithm completes without detecting errors then we have proved an
important component of referential integrity at the namespace level. However
there are other checks on directories we might wish to perform. We should
check that the name of each directory entry is unique. (Versioning will
increase the complexity of this test.) If the directory is indexed, we should
verify referential integrity of the index.

Directory Connectivity

We need to check that directory parent links form a completely connected,
acylic (inverted) tree and that all directories belong to this tree. For this,
we may wish to build the complete tree in memory in pass one. After pass two,
we walk the complete "shadow" tree starting from the root inode, and set each
corresponding shadow inode map byte to zero for each tree node. If some shadow
inode map byte is already zero, we have detected a cycle. After walking the
shadow tree, we check that the shadow inode number map is completely zero. If
not, then some directory subtrees are disconnected.

Repair

Here are a few ideas for types of repair we can do, ranging from practical and
immediately doable to rampant flights of fancy.

Perhaps the most straightforward repair is allocation bitmaps. There is not
much left to the imagination: after we have discovered all properly allocated
blocks and all plausible lost subtrees, each block is either referenced from
the filesystem tree or it is not. We may simply copy any blocks that differ
from the shadow bitmap to the allocation bitmap.

If the type given in some dirent does not match the type of the referenced
inode then we must guess which one is wrong (and they both may be). Examining
the contents of the file data tree may help: if it consists entirely of
properly tagged directory entry blocks then the inode most probably is a
directory.

If we have many lost inodes then we could search the entire volume for lost
blocks tagged as directory entry blocks. These may lead us back to a missing
directory file, or perhaps we can reconstruct a directory in its entirety from
lost directory entry blocks.

More repair ideas coming later. Thanks for reading.

Regards,

Daniel