[Tux3] Inode attributes

Daniel Phillips phillips at phunq.net
Sun Aug 24 05:27:19 PDT 2008


Inode attributes are central to Tux3.  Not only do they carry the usual
data needed to implement Posix semantics, they are a key part of the
Tux3 versioning scheme.

Tux3 departs from the usual idea of fixed size inodes on disk, and
instead represents an inode as a linear list of attributes packed
together end to end, as many as are needed to represent the inode.
This representation is forced by Tux3's versioning scheme: all
versions of all attributes of the inode are represented contiguously
in one region of the inode table tree.  This is necessary in order to
for version inheritance to work efficiently (notwithstanding plans to 
break this version blob up into independent subtrees later in Tux3's
life to scale up to very large numbers of versions).

It is possible for an inode to grow quite large if some attribute, say
the file size, changes in every version and a large number of versions
are retained.  The intial Tux3 code will support 512 versions (10 bits
allocated for version label = 1024 versions, but need to allow for half
of these as ghosts, see the versioned pointer post).  If a file grows
continuously its size will be different in every version, so we might
well have to store 512 different size attributes for some files.  So
size attributes better be small.  (Hmm, why don't we delta encode them
then, the versioning algorithm inherently identifies the historical
chain anyway.  Good idea, let's make a note to do that later.)

Small in this case translates into 16 bytes per attribute.  We need
60 bits for the file size (1 exabyte with byte granularity) and also
need to store the mtime every time it changes, so we put it into the
same attribute.  For every kind of inode attribute, we need to store
the version, 10 bits, and a 4 bit tag to identify the kind of attribute
when we scan through the list of attributes for the inode.  48 bits
worth of mtime should be enough, which all adds up to a couple bits
short of 16 bytes.

A digression: up till now I have been playing with ways of packing all
the different kinds of attributes Tux3 has into 8 byte granular units.
But there is not really a strong argument for aligning these objects:
once we have found our way to an inode table and iterated through the
attribute list to find the ones we want, unpacking unaligned fields is
really a completely trivial cost.  So Tux3 inode attributes are going
to be byte aligned.  This will make them a little more compact, maybe
10% or so, maybe a little more.  Given that the CPU cost is small, that
10% is worth it, and it will relieve me from going through contortions
or making compromises in, say, time resolution, just to pack reasonably
into fixed size units.

Tux3 is supposed to be a big endian filesystem, but so far no endian
work at all has been done.  It starts now, with the iattr.c file.  We
will go back and fix all the other files in due course.  But attributes
need that functionality now, because the attribute decoder needs a
predictable place to find certain generic fields of attributes.  In
other words, attributes are not entirely opaque to the attribute
decoder.

To reiterate: attributes are variable sized with byte granularity.  The
first byte of each attribute is labeled with the kind of attribute, a
four bit field in the high bits of the byte.  (For those who have never
experienced the pleasures of big endian organization, when you hexdump
a big endian structure you can see fields that cross bytes easily, as
opposed to the scrambled eggs effect of little endian brain damage.)

The attribute scanning loop always needs to know about two fields in
each attribute - the kind and the version - and sometimes will need to
know where to find a size field for a variable sized attribute like
immediate data.  So the first two bytes of each attribute naturally
contain the attribute kind and the version label, total 14 bits.  (We
will probably think of something useful to do with the remaining two
bits!)

The version lookup algorithm as currently implemented requires a single
pass through all versions of a given attribute kind.  One day that may
be improved so that only Log(attributes) need to be exammined, but for
now it fits in nicely with the simple idea of always traversing all the
attributes of an inode when we initially load it into memory.

Returning to the discussion of the inode size attribute, we might have
512 of those 16 byte attributes.  That is 8K of attribute data,
a little more than two inode table blocks.  Other attributes are not
nearly as likely to change frequently, so we can guess that very few
Tux3 inodes will need much more than 8K in the inode table.  And 8K
is not really that big a problem.  If the file size changed so much,
the file is probably pretty big and that is the cost that will
dominate.

I have described the notion of attribute groups elsewhere and will not
reiterate that now.  Better to just write the code at this point and
explain any subtleties that arise afterwards.

Just in case it sounds like Tux3 is going to have to do a lot of work
to load an inode, it is not as bad as it sounds.  For one thing, if you
are not holding snapshots at all, the variable sized attribute scheme
means that Tux3 inodes will be on average significantly smaller than
any other filesystem I know of.  If you are holding lots of versions,
surely not every file will grow continuously on every snapshot, only a
small fraction.  And even if it is every file, walking across a few K
of data is just not a big expense these days, if it ever was.  If it
does turn out to matter then Tux3 will acquire an indexing scheme for
the inode attributes themselves earlier instead of later.  I will
predict now that this cost does not matter for the forseeable future.
We are not very far way from putting this question to the test.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list