[Tux3] Four kinds of inode data attributes

Sat Aug 9 00:05:13 PDT 2008

Here is a mini design note on the Tux3 inode attribute design,
specifically data attributes.  A tux3 inode is a variable sized
object that is a list of versioned attributes.  Data attributes
carry file and extended attribute data.  There are four kinds of
data attributes, representing increasingly larger data objects:

1) Immediate data (up to ~4 KB)
  * Data stored literally in inode
    - short file
    - symlink
    - version link

2) Direct data (up to ~128 GB)
  * Data attribute contains extents pointing at data blocks
     - 16 bit extent count
     - Data for a single version
     - No sparse files

3) Btree leaf data (up to ~48GB)
  * Data attribute points at one or more btree leaves
     - 6 bit extent count
     - Data for multiple versions
     - sparse files

4) Indexed btree data (up to 1 EB)
  * Data attribute contains root block and levels of a btree
     - 6 bit extent count
     - Data for multiple versions
     - sparse files

This scheme is remotely analogous to the direct/indirect/double/triple
data indexing scheme of Ext3 and traditional Unix, however the latter
is really a radix tree and as such is unable to handle variable sized
data at the leaves.  (A radix tree contains no keys, and depends on
being able to follow index pointers by directly calculating the
position of an index pointer at each level of the tree based on the
numeric value of the target key.)  A btree also handles sparse data
better than a radix tree, which at worst may need one index node for
each sparse data block.

At the moment only the fourth, fully general case is implemented, which
stores each data attribute as a full btree with at minimum one index
node and one leaf containing extents pointing at data blocks.  So a one
byte file would need three blocks: node, leaf, data.  A very wasteful
way to store small files, but at this point we are more interested in
correctness than space efficiency.  Over time the other three data
attribute types will be implemented with appropriate transitions
between them, and in the end I think a one byte file will only need
about 64 bytes including all standard Unix file attributes, which will
be about as compact as it gets for a full featured filesystem.

The space optimization strategy is:

  * If a file is smaller than an inode table leaf block less the block
    overhead and the space used by other attributes of the same inode,
    it is packed into the block as an immediate data attribute.

  * Otherwise, if the file has fewer than some number of extents,
    the extents are packed into the block as a direct data attribute.

  * Otherwise, if the file extents need fewer than some number of
    btree leaf blocks, the leaf blocks are referenced by a btree leaf
    attribute.

  * Otherwise, the file data is represented by a full data btree with
    one or more index levels.

Direct data is a nice intermediate step between immediate data and the
(degenerate) btree leaf form because it saves an index block, which is
significant for smaller files.  A direct data attribute cannot be
written randomly by a child version: there is no way to represent the
partially modified result.  Instead, it will be converted to btree leaf
pointer form, which can represent mixes, multiple versions.  However, a
direct data attribute can be truncated and completely rewritten in a
child version without needing to be converted: the child gets its own,
completely independent data attribute.

A file data btree attribute need not itself carry a version because the
extents at its leaves carry all necessary verion information.  However,
suppose that a very large file is truncated to zero in some child
version and then rewritten.  None of the data blocks in the btree
belonging to the parent or earlier versions will be inherited by the
child or later versions, so there is little or no advantage in storing
the child data pointers in the same btree.  This is a case where it is
convenient for the data attribute to carry a version, so the child gets
its own data attrbute, which may be any of the four kinds listed here
depending on the size of the rewritten file.

Size attribute

For the two kinds of btree file data, the file size attribute is
represented separately from the data attribute itself, which is
necessary because a single data btree may reference data blocks for
many different versions.  It is possible that there may be one size
attribute for each version, which can cause an inode to overflow into
multiple blocks.  It is for consideration whether some kind of indexing
structure is desirable for representing large numbers of size
attributes.  For now, it will just be a linear list possibly spanning
multiple blocks in the worst case, but typically being no more than a
few elements long.  (Each size attribute is 32 bytes, which also
includes mtime, a version label and an attribute type, see the upcoming
post on inode attributes.)

The size of direct data is computed by adding up the extents, all of
which are for the same version.  (No sparse files allowed for direct
file data.)

The size of an immediate data attribute is encoded in the attribute, as
is natural.

Complexity

This scheme will be a bug farm, I can sense that now.  Complexity comes
from converting back and forth between the different data attribute
forms, keeping the versions straight, locking for parallel access, and
getting immediate data into and out of the page cache.  But I doubt it
will be all that much code.  Here, Hammer's big flat btree looks
tempting in the way it cuts away all these microoptimizations.  But at
the cost of fatter, slower access.  My approach is just to grit my
teeth and get ready to implement this complex code correctly.  As with
the other complex bits that we already have working, for example, the
compressed index leaf format, versioned pointer processing and btree
mass delete+merge, the complexity does not spill over into other parts
of the system.  It can be torture tested in isolation until solid.  And
three of the four forms are just optimizations of the general form: we
introduce that, get it solid, then enable the next more compact form,
repeat as necessary.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3