[Tux3] Count: a few thoughts about REALLY outstanding features
phillips at phunq.net
Sun Dec 21 15:15:08 PST 2008
On Saturday 20 December 2008 12:50, Michael Pattrick wrote:
> >> Just another filesystem I am afraid, with one big advance
> >> (versioning) and a number of incremental ones.
> And that's the way to go about it, it may seem easy to tack side
> projects on but scopecreep (ScopeCreep: the devourer of souls and
> government projects) could destroy Tux3. Adding this type of feature
> would increase the complexity of a filesystem with the stated goal of
> having a 'tight' code base. Having a well defined list of reasonable
> features increases the likelihood that a project will be successful,
> adding a feature like this right now - just as Tux3 is preparing for
> its mainline merge- could delay the merge, increase the time needed to
> document, increase code complexity, and possibly introduce new types
> of bugs.
> But that's just my take on it.
That's it all right. To be more specific, by sticking with the rule
that each allocated extent has exactly one pointer to it, we bypass a
whole class of complexity and associated bugs. When versioning is
added using the versioned pointers method (versioned extents, versioned
attributes) we still keep the single pointer per extent model. At that
point, we have snapshotting in a nice flexible form including writeable
snapshots of snapshots, without elaborating the Tux3 structural model
at all. Only the btree leaf block scanning and editing code changes.
We will use our user space unit testing strategy to handle the
additional leaf handling complexity, to give us the large number of
development and testing iterations that are necessary to make code of
that nature work really reliably. A large number of unit testing
iterations also helps code settle down to a relatively simple form.
Look in version.c and check out the unit testing there to see what I
mean: it implements a random fuzz tester to beat heavily on corner
cases, trying out millions of combinations in a few seconds and
checking for correctness at every step. Of course, this is no
substitute for thinking deeply about what is going on, but it is a
powerful tool for catching issues that slip through the net of pure
When we add the additional versioning complexity to the ileaf and dleaf
processing code, we will have another layer of unit testing at the leaf
What this means is that to implement versioning, we combine two well
tested components: our classic single-referenced filesystem design and
versioning logic that stays strictly within the the dleaf processing.
We therefore hope that the vast majority of bugs will be caught by Tux3
developers in unit testing and not by users in full-system testing.
Now, single referencing does not immediately support data de-duplication
and pointer techniques to avoid file copies. But it does support
snapshotting, and should make it easier to do online expand, shrink and
checking reliably. These are the must-have features that are currently
deficient in Linux, and are real impediments for Linux storage. I
respect and admire those developers who are willing to jump in and
tackle those other cool features, but to get where we need to be in the
Linux storage space, our little group needs to stay focussed on
That said, we will eventually elaborate the Tux3 allocation model to add
an allocation btree as a complement to the bitmap table. I have
written a little bit about this previously. The executive summary is:
for highly fragmented filesystems, bitmap allocation is more efficient
than extents (up to 50 times more space efficient) while for large files
on unfragmented filesystems, extent allocation is much more efficient.
The efficiency equation is compelling enough to justify some extra
complexity in order to switch between them, depending on observed
allocation statistics. The point of this is, when extent allocation
arrives, we can have reference counts on the extents and use that to
implement such things as de-duplication. Future fanciness.
If somebody wanted to work on de-duplication right now, I would
recommend using a per-block reference count table mapped into a file,
like the xattr atom refcounting we already have. This is not the most
efficient reference counting mechanism in the world, but it will work
fine for testing algorithms and proving the worth of the feature.
Tux3 mailing list
Tux3 at tux3.org
More information about the Tux3