[Tux3] Design note: Offline filesystem check

Fri Dec 12 13:54:54 PST 2008

Here are a few notes on some simple things we can do to get some
filesystem checking going early, while being pretty lazy.

Tux3 has a very simple structure.  The superblock points to two 
different entities:

  - Filesystem tree: a btree of btrees (inode table from which
    data btrees descend)

  - Log blocks that define changes to the filesystem tree
    (format not precisely defined yet)

Tux3 has other, higher level structures, but they are all mapped into 
data files, such as the allocation bitmap, extended attribute atom 
table and directory files.  We can check the internal structure of 
these files too, however first it would be nice to know that the file 
structure itself is consistent, which is what today's proposal is 
about.

The complete and consistent state of the Tux3 filesystem is obtained by 
applying the changes encoded in log blocks to blocks encode on disk, 
leaving the result in block cache (block images indexed by disk 
location).  We can then traverse the entire Tux3 filesystem tree in a 
straightforward way through the block cache, either with our userspace 
code or in kernel.  This idea can form the basis of a rudimentary 
filesystem consistency checker.

The general plan is to walk the filesystem tree via the block cache, 
checking the format of each index block before trusting it, and marking 
off the visited blocks in a bitmap as we go in order to detect multiply 
referenced blocks (also detecting cycles as a side effect).  At the end 
of the traversal, the visited bitmap should match the block allocation 
bitmaps.

This simple idea works because Tux3 does not allow multiple references 
to blocks, which can be seen as a robustness feature.  Tux3 does not 
require multiple references to blocks to implement snapshotting because 
of the versioned extents design, which uses only a single pointer to 
each extent that is inherited by children of the version in which the 
extent was allocated.

When we start using log blocks (pretty soon) then the tree traversal is 
preceded by a "reconstruct" step where we create the filesystem tree 
image partitally in cache by replaying log blocks.  During the 
reconstruct, we mark off log blocks in the visited bitmap.  We don't 
have log blocks as of today, so we just leave out that step.

With 4K blocksize, the size of the visited bitmap is 32 megabytes per 
terabyte:

   2**40 / 2**(12 + 3) = 2**25 bytes = 32 MB

This seems fairly comfortable.

We already have a generic "advance" function in btree.c that can walk 
either the inode table block btree or data file index btree.  (At the 
moment, we use the same index block format for both, but this will not 
always be the case: the advance function will start using a generic 
method of knowing how many children an index block has, retrieving the 
next block pointer, and reading the next block.)  This should form the 
basis of the tree walking.  It does need to be elaborated a little to 
check the format of an index block before starting to read pointers 
from it.  This can be done as a cut and paste for now, and later 
incorporate a hook in the advance function itself, giving us the 
ability to check index blocks during live operation.

Using the "advance" style of coding, it is easy to write the tree walker 
non-recursively, a big help to robustness and clarity.  There will be 
two nested advance loops, one for the inode table and one for data 
btrees.  We do not attempt to check the internal structure of metadata 
files in this pass, because first we want to know that the files are 
structurally sound.

We already have the idea of "leaf check" methods specific to different 
types of btrees.  We can use this to make our tree walk serve dual 
purpose, as either a checker or dumper, by optionally calling 
the "show" method for an index leaf.

After we have a walker for the main filesystem tree running, we start 
worrying about the internal structure of metadata files:

  - Allocation bitmap
  - Version table
  - Atom table
  - Directories

When we have versions, we check the version table at this point.  Then 
another pass through the whole tree can check for valid topology of 
inode attribute and file extent versions.

To detect directories, we walk the inode table again.  Directories are 
versioned files, so some versions might be consistent while other 
versions have problems.  When version support is added, we can be 
simple minded about this and check every version: first walk the 
directory index tree making a list of versions, then mindlessly check 
each version.  This will be pretty gross, but not too long afterwards I 
think the notion of directories as versioned files will be replaced by 
directory entries as versioned attributes, which will save a lot of 
metadata space and make checking more efficient.

After checking the low level structure of each file, we want to check 
directory topology and consistency of inode link counts.  Let's worry 
about that later when we have some sort of low level check running.

To start with, if it finds anything wrong, the walk just stops with a 
report.  Later we will try to keep going, and later still, we will 
attempt some repair.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3