[Tux3] The long and short of extended attributes

Daniel Phillips phillips at phunq.net
Sun Sep 7 16:13:17 PDT 2008


Extended attributes are traditionally patched onto a filesystem as an
afterthought.  I thought, it better not be that way with Tux3.  And the
best way to be sure it is not that way is to write an early prototype.

So far, only the on-disk representation of extended attributes has been
described.  These principles apply:

 * An extended attribute is represented on disk the same as an inode
   data attribute, except it has an "atom" tag to specify the kind of
   extended attribute.

 * As a consequence, small extended attributes are stored directly in
   the inode table block, and large ones are stored in a btree

 * Tux3 has an "atom table" that maps extended attribute names to small
   integers.  This table is expected to be small.

It is not decided yet whether the atom table is versioned.  Versioning
the atom table would mean that the interpretation of the atom number is
different for different versions, which might not be a problem: the
attribute scanning algorithm already has to ignore any attributes not
directly inherited by the target version, and if the attribute is
inherited then so is the atom table entry.  This is really subtle, and
against that subtlety (read: bug attractor) we had better have a strong
argument in favor of atom table versioning.

If the atom table is not versioned, then Tux3 will use the same atom
number across all versions, and better keep track of how many times
each atom is used so that an unused atom table entry can eventually be
recovered and recycled.  If the atom table is versioned then it is
tempting to just let atom numbers hang around forever regardless of
whether they are used in a particular version or not.  But on the other
hand, one can imagine a load that creates a stupidly large number of
different xattr types, turning the atom table into a gigantic thing
that never goes away even if all those xattrs are immediately deleted.
I would call this a strong argument in favor of tracking atom "link
count" even if the atom table is not versioned.

Updating atom link count on disk may or may not be expensive.  I think
forward logging serves as our "bat thermal underwear" here, and will
protect Tux3 nicely against performance problems due to persistent atom
use count updating.  Anyway, ideas on this topic are welcome.  I have
still not decided whether atoms really need to be use-counted.  They
will not be in the first prototype, and the atom table will not be
versioned.

Why does Tux3 want to use atom numbers at all for encoding xattrs,
when all other filesystems I know of just store the ascii name of each
xattr along with the xattr data?  Tux3 is going to have to resolve the
xattr name on every xattr operation anyway, so performance is not an
argument, at least not obviously.  Advantages are:

 * Compactness.  This indirectly affects performance by reducing the
   amount of inode data that has to be loaded, scanned, updated and
   saved.

 * Regularity of attribute structure.  The attribute versioning
   algorithms are quite complex, so the more similar different kinds
   of attributes are to other objects, the fewer corner cases to worry
   about way down at the lowest access level where the action happens.

 * Future algorithms that operate internally to Tux3, to filter
   inodes by extended attribute for one reason or another, possibly
   security.  If such internal use of attribute numbers turns out to
   be advantageous, it could happen that concept of attribute name
   resolution is lifted up into the vfs, creating new optimization
   possibilities, and possibilities to share attribute resolution code
   across filesytems (there has been some attempt at this in Ext2/3/4
   xattr processing, which introduced the concept of a block cache to
   assist in detecting and sharing attribute blocks that are exactly
   the same.  I wonder about the efficacy of that optimization.)

This is yet another case where Hammer's fat btree keys permit a nice,
regular approach to inode attributes, whereas in Tux3 we have to work
a little harder.  On the other hand, Hammer has to work a little harder
to collect together all the xattrs when loading an inode, whereas Tux3
always has them immediately available once it finds its way to the
correct inode table block.  On the third hand, if we are ignoring
xattrs and a high percentage of inodes have them (why are we ignoring
them again?) then loading an inode gets more expensive.  We might wish
to set xattrs off into separate inode table blocks if this turns out
to be an issue.  For now, we are just going to mix them together with
other inode attributes and see how it goes.

To achieve any kind of performance, xattrs have to be cached above the
inode table block level.  It simply will not do to have Tux3 go probing
through the inode table btree index for each and every getxattr call.
My solution is to link each xattr onto a list owned by struct inode at
the time the inode is loaded into cache.  Xattrs should then exhibit
similar cache performance to inodes themselves, without requiring any
new cache mechanism.  But the cache footprint of an inode will be a lot
bigger... because the xattrs are cached.  That seems like a fair trade
to me.  If it turns out to be an issue then we can add an "xattrs
present" flag or even a bitmap to the inode so that xattrs are cached
lazily, which is to say, not when the inode itself is requested, but
the other way round: a getxattr will load the inode and cache at least
the requested xattr.

A large xattr, which is to say, big compared to the size of an inode
table block, is stored in a btree or directly indexed block just like
an inode data attribute.  In kernel, that means creating a "mapping"
(aka address_space, aka page cache) for the xattr.  In Tux3 userspace
we do the same, somewhat more cleanly, but essentially the same.  Then
the per-inode xattr list includes the in-memory data attribute forms
in place of the immediate attribute data, and of course access takes a
longer path.  I think we are winning at this point, as we are caching
all forms of xattr data by the same machinery as file data, and in
particular, big xattrs are cached, paged, reloaded, and etc by the same
generic and heavily optimized mechanism as file data.

Detecting identical xattr values is an intriguing idea that is sort of
implemented and working in Ext3, though I do not know how well.  I do
not intend to attempt such a thing in Tux3 just now, mainly because I
do not see why such a content-addressed compression scheme should be
limited just to xattrs.  Why not all file data?  Let's just leave that
out for now and see if there is a benchmarking argument for trying to
do something a little special there, later.  I like Shapor's idea of
having files inherit attributes from the directory they are initially
created in.  So inherited attributes don't actually have to be stored
per file until the exact match is broken.  This inheritance idea is
somewhat analogous to version inheritance, so I sense something that
looks like design synergy.

Like a data btree, xattrs have their own variant of get_block so that
large xattrs can be accessed stream-wise via read/write.  This variant
just resolves the xattr by looking up the atom and searching the inode
xattr list to find the xattr btree, before proceding exactly as for a
data attribute.  This mechanism is forward-looking in that Linux has no
interface for streaming xattr reads at present.  But handling this in
advance in Tux3 will be a strong argument for adding such an interface.
Essentially, Tux3 merges the concept of xattr and file fork internally,
and hopefully provides a strong argument by example of why Linux should
acquire an external API to accomodate this.

For that matter, Linux needs "getxattrat" or similar to retrieve an
xattr from an inode that is already open.  After all, the most common
xattrs are acls, and these were already retrieved when the inode was
opened to check the rights of the calling task.  It is surprising that
we do not already have such an interface, or it is an indication that
xattrs are not in extensive use yet.  Tux3 can do a lot to help there,
but in order to promote such a userspace API improvement, Tux3 better
perform well as a basic filesystem.  Which has been the plan all along.

Now let's see if we can translate these xattr thoughts into working
code, and hopefully while I am doing that one of our two fuse
implementations will progress to the point where we can easily add
xattr support and get some proper testing underway.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list