Towards basic filesystem checking
Daniel Phillips
phillips at phunq.net
Sun Dec 30 01:18:23 PST 2012
Tux3 needs a basic fsck utility, as a development tool, and to lay the
foundation for a usable offline integrity checking tool. Later, we will tackle
more ambitious projects like online integrity checking and repair, however
right now we need something very simple. This note describes the beginnings of
an offline fsck command in sufficient detail that an interested volunteer could
implement it with a modest amount of work. This would be a great way to get to
know Tux3's internal design and operation.
Basic checks
The initial goal is to verify that a Tux3 volume is "hermetic", that is, all
allocated blocks on a Tux3 volume belong to the filesystem tree or log and all
others are marked as free space. It is convenient, or necessary to perform
some other basic checks at the same time:
- All block pointers are less than the volume size
- Each btree leaf passes the existing leaf integrity checks
- Each block is referenced by exactly one pointer, either as a
single metadata block or as part of an extent
- Allocation bitmap bit counts match free block statistics.
- Btrees have no cycles
- Each attrs of an inode is valid
- All necessary attrs are present for each inode
- Inode attrs are not duplicated
- Data structure of btree nodes and leafs is valid (leaf check methods
already exist)
- Each dirent in a dirent block is valid
- Any inode xattrs are valid and consistent with the xattr atom table
Data Structures
We can use a simple data structure that is just a huge array of block status
codes to help with physical integrity checks. Possible format:
struct blockcode { unsigned char type:4, refs:2; };
and:
enum blocktype {
free, data, dleaf, dtree, ileaf, itree, dirent, bitmap, log,
... };
The blockmap array must be large enough to map the entire volume. A large,
static array would be a quick way to start:
struct blockcode blockmap[max_volume_size];
This can be efficiently scaled by turning it into a block cache, where each
block has the form:
struct blockcode blockmap[blocksize];
For a later online check, this naturally takes the form of an address_space,
which in kernel is a radix tree, one per inode. So it would be natural to use
a temporary inode for this, and possibly do the same in user space.
With just two bits for the block refcount, the allowed values might be zero,
one, two and many. A one bit refcount would actually be enough, because all we
really need to know is whether we have seen a particular block in the
filesystem tree traversal before. However if multiple references are detected
it might be nice to have some idea of the incoming count.
Algorithm
The algorithm for this basic fsck is straightforward. First, replay the log to
reconstruct dirty cache. After this, we only access volume blocks through the
block cache (volmap and inodes) so that dirty metadata blocks take precedence
over possible stale versions recorded on media which may not be consistent
with the updated view from cache. This just works naturally with our existing
block cache mechanism.
Next we clear the entire volume blockmap to zero (free), and traverse the
itable and all inodes embedded in it. We may do this at first with a single
pass, or we may find that multiple passes make more sense.
The tree traversal may be iterative instead of recursive, in the interest of
robustness. (Support for iterative traversal exists in our btree library.) In
any case, each time we consider a new subtree node, we check it for basic
correctness before relying on it to access other blocks, including verifying
that each referenced block is marked "free" in the blockmap, that is, it has
not been seen before in this traversal. We should skip any block that has been
seen before, however we should attempt to continue the traversal, just to see
how much damage exists. (Or how badly broken our fsck algorithm is!).
At each inode table tree leaf, we should run the existing leaf check method,
then if that does not detect problems, traverse the leaf itself. For each
inode of an itable leaf, we perform basic validity checks on each attribute as
above, and perform a tree traversal of each data tree. At each data tree leaf,
mark all blocks referenced by each extent as "data", or if the inode is
metadata, as something more specific such as "bitmap", "dirent" or "atoms". (It
is not immediately clear what we might do later with this information, but no
doubt we will think of something.)
After completing the itable traversal, we then walk the blocks of the bitmap
inode to verify that all blocks marked free in the bitmap are also marked free
in the blockmap, and otherwise are marked as some kind of block referenced by
the filesystem tree.
After this basic checking of physical structure, we will proceed to check
logical consistency, particularly between directories and the inode table. We
need to know that each inode is referenced by some directory exactly as many
times as indicated by its link count attribute, and we need to check that
every dirent references a valid inode. It would also be nice to check that no
directory entries are duplicated, and that none contain nulls or slash
characters, and that there are no zero length entries. We also need to check
that atom counts (xattr tags) match xattr usage in inodes. Algorithms for
these checks are not covered here (later!).
Implementation
Hirofumi's (wonderful!) tux3graph command already implements a filesystem
traversal as described here. We can use tux3graph as a template for an initial
fsck, and perhaps extract a generic filesystem traversal framework from it to
drive either tux3graph or fsck. In any case, tux3 fsck takes the form of a
command option of the "tux3" command, just like tux3graph.
Regards,
Daniel
More information about the Tux3
mailing list