[Tux3] Design note: Offline filesystem check
phillips at phunq.net
Fri Dec 12 13:54:54 PST 2008
Here are a few notes on some simple things we can do to get some
filesystem checking going early, while being pretty lazy.
Tux3 has a very simple structure. The superblock points to two
- Filesystem tree: a btree of btrees (inode table from which
data btrees descend)
- Log blocks that define changes to the filesystem tree
(format not precisely defined yet)
Tux3 has other, higher level structures, but they are all mapped into
data files, such as the allocation bitmap, extended attribute atom
table and directory files. We can check the internal structure of
these files too, however first it would be nice to know that the file
structure itself is consistent, which is what today's proposal is
The complete and consistent state of the Tux3 filesystem is obtained by
applying the changes encoded in log blocks to blocks encode on disk,
leaving the result in block cache (block images indexed by disk
location). We can then traverse the entire Tux3 filesystem tree in a
straightforward way through the block cache, either with our userspace
code or in kernel. This idea can form the basis of a rudimentary
filesystem consistency checker.
The general plan is to walk the filesystem tree via the block cache,
checking the format of each index block before trusting it, and marking
off the visited blocks in a bitmap as we go in order to detect multiply
referenced blocks (also detecting cycles as a side effect). At the end
of the traversal, the visited bitmap should match the block allocation
This simple idea works because Tux3 does not allow multiple references
to blocks, which can be seen as a robustness feature. Tux3 does not
require multiple references to blocks to implement snapshotting because
of the versioned extents design, which uses only a single pointer to
each extent that is inherited by children of the version in which the
extent was allocated.
When we start using log blocks (pretty soon) then the tree traversal is
preceded by a "reconstruct" step where we create the filesystem tree
image partitally in cache by replaying log blocks. During the
reconstruct, we mark off log blocks in the visited bitmap. We don't
have log blocks as of today, so we just leave out that step.
With 4K blocksize, the size of the visited bitmap is 32 megabytes per
2**40 / 2**(12 + 3) = 2**25 bytes = 32 MB
This seems fairly comfortable.
We already have a generic "advance" function in btree.c that can walk
either the inode table block btree or data file index btree. (At the
moment, we use the same index block format for both, but this will not
always be the case: the advance function will start using a generic
method of knowing how many children an index block has, retrieving the
next block pointer, and reading the next block.) This should form the
basis of the tree walking. It does need to be elaborated a little to
check the format of an index block before starting to read pointers
from it. This can be done as a cut and paste for now, and later
incorporate a hook in the advance function itself, giving us the
ability to check index blocks during live operation.
Using the "advance" style of coding, it is easy to write the tree walker
non-recursively, a big help to robustness and clarity. There will be
two nested advance loops, one for the inode table and one for data
btrees. We do not attempt to check the internal structure of metadata
files in this pass, because first we want to know that the files are
We already have the idea of "leaf check" methods specific to different
types of btrees. We can use this to make our tree walk serve dual
purpose, as either a checker or dumper, by optionally calling
the "show" method for an index leaf.
After we have a walker for the main filesystem tree running, we start
worrying about the internal structure of metadata files:
- Allocation bitmap
- Version table
- Atom table
When we have versions, we check the version table at this point. Then
another pass through the whole tree can check for valid topology of
inode attribute and file extent versions.
To detect directories, we walk the inode table again. Directories are
versioned files, so some versions might be consistent while other
versions have problems. When version support is added, we can be
simple minded about this and check every version: first walk the
directory index tree making a list of versions, then mindlessly check
each version. This will be pretty gross, but not too long afterwards I
think the notion of directories as versioned files will be replaced by
directory entries as versioned attributes, which will save a lot of
metadata space and make checking more efficient.
After checking the low level structure of each file, we want to check
directory topology and consistency of inode link counts. Let's worry
about that later when we have some sort of low level check running.
To start with, if it finds anything wrong, the walk just stops with a
report. Later we will try to keep going, and later still, we will
attempt some repair.
Tux3 mailing list
Tux3 at tux3.org
More information about the Tux3