Towards basic filesystem checking

Sun Dec 30 01:18:23 PST 2012

Tux3 needs a basic fsck utility, as a development tool, and to lay the 
foundation for a usable offline integrity checking tool. Later, we will tackle 
more ambitious projects like online integrity checking and repair, however 
right now we need something very simple. This note describes the beginnings of 
an offline fsck command in sufficient detail that an interested volunteer could 
implement it with a modest amount of work. This would be a great way to get to 
know Tux3's internal design and operation.

Basic checks

The initial goal is to verify that a Tux3 volume is "hermetic", that is, all 
allocated blocks on a Tux3 volume belong to the filesystem tree or log and all 
others are marked as free space. It is convenient, or necessary to perform 
some other basic checks at the same time:

   - All block pointers are less than the volume size

   - Each btree leaf passes the existing leaf integrity checks

   - Each block is referenced by exactly one pointer, either as a
     single metadata block or as part of an extent

   - Allocation bitmap bit counts match free block statistics.

   - Btrees have no cycles

   - Each attrs of an inode is valid

   - All necessary attrs are present for each inode

   - Inode attrs are not duplicated

  - Data structure of btree nodes and leafs is valid (leaf check methods
    already exist)

  - Each dirent in a dirent block is valid

  - Any inode xattrs are valid and consistent with the xattr atom table

Data Structures

We can use a simple data structure that is just a huge array of block status 
codes to help with physical integrity checks. Possible format:

   struct blockcode { unsigned char type:4, refs:2; };

and:

   enum blocktype {
        free, data, dleaf, dtree, ileaf, itree, dirent, bitmap, log,
        ...  };

The blockmap array must be large enough to map the entire volume. A large, 
static array would be a quick way to start:

   struct blockcode blockmap[max_volume_size];

This can be efficiently scaled by turning it into a block cache, where each 
block has the form:

   struct blockcode blockmap[blocksize];

For a later online check, this naturally takes the form of an address_space, 
which in kernel is a radix tree, one per inode. So it would be natural to use 
a temporary inode for this, and possibly do the same in user space.

With just two bits for the block refcount, the allowed values might be zero, 
one, two and many. A one bit refcount would actually be enough, because all we 
really need to know is whether we have seen a particular block in the 
filesystem tree traversal before. However if multiple references are detected 
it might be nice to have some idea of the incoming count.

Algorithm

The algorithm for this basic fsck is straightforward. First, replay the log to 
reconstruct dirty cache. After this, we only access volume blocks through the 
block cache (volmap and inodes) so that dirty metadata blocks take precedence 
over possible stale versions recorded on media which may not be consistent 
with the updated view from cache. This just works naturally with our existing 
block cache mechanism.

Next we clear the entire volume blockmap to zero (free), and traverse the 
itable and all inodes embedded in it. We may do this at first with a single 
pass, or we may find that multiple passes make more sense.

The tree traversal may be iterative instead of recursive, in the interest of 
robustness. (Support for iterative traversal exists in our btree library.) In 
any case, each time we consider a new subtree node, we check it for basic 
correctness before relying on it to access other blocks, including verifying 
that each referenced block is marked "free" in the blockmap, that is, it has 
not been seen before in this traversal. We should skip any block that has been 
seen before, however we should attempt to continue the traversal, just to see 
how much damage exists. (Or how badly broken our fsck algorithm is!).

At each inode table tree leaf, we should run the existing leaf check method, 
then if that does not detect problems, traverse the leaf itself. For each 
inode of an itable leaf, we perform basic validity checks on each attribute as 
above, and perform a tree traversal of each data tree. At each data tree leaf, 
mark all blocks referenced by each extent as "data", or if the inode is 
metadata, as something more specific such as "bitmap", "dirent" or "atoms". (It 
is not immediately clear what we might do later with this information, but no 
doubt we will think of something.)

After completing the itable traversal, we then walk the blocks of the bitmap 
inode to verify that all blocks marked free in the bitmap are also marked free 
in the blockmap, and otherwise are marked as some kind of block referenced by 
the filesystem tree.

After this basic checking of physical structure, we will proceed to check 
logical consistency, particularly between directories and the inode table. We 
need to know that each inode is referenced by some directory exactly as many 
times as indicated by its link count attribute, and we need to check that 
every dirent references a valid inode. It would also be nice to check that no 
directory entries are duplicated, and that none contain nulls or slash 
characters, and that there are no zero length entries. We also need to check 
that atom counts (xattr tags) match xattr usage in inodes. Algorithms for 
these checks are not covered here (later!).

Implementation

Hirofumi's (wonderful!) tux3graph command already implements a filesystem 
traversal as described here. We can use tux3graph as a template for an initial 
fsck, and perhaps extract a generic filesystem traversal framework from it to 
drive either tux3graph or fsck. In any case, tux3 fsck takes the form of a 
command option of the "tux3" command, just like tux3graph.

Regards,

Daniel