[Tux3] The long and short of extended attributes

Sun Sep 7 17:43:39 PDT 2008

On Sun, Sep 7, 2008 at 4:13 PM, Daniel Phillips <phillips at phunq.net> wrote:
> Extended attributes are traditionally patched onto a filesystem as an
> afterthought.  I thought, it better not be that way with Tux3.  And the
> best way to be sure it is not that way is to write an early prototype.
>
> So far, only the on-disk representation of extended attributes has been
> described.  These principles apply:
>
>  * An extended attribute is represented on disk the same as an inode
>   data attribute, except it has an "atom" tag to specify the kind of
>   extended attribute.
>
>  * As a consequence, small extended attributes are stored directly in
>   the inode table block, and large ones are stored in a btree
>
>  * Tux3 has an "atom table" that maps extended attribute names to small
>   integers.  This table is expected to be small.
>
> It is not decided yet whether the atom table is versioned.  Versioning
> the atom table would mean that the interpretation of the atom number is
> different for different versions, which might not be a problem: the
> attribute scanning algorithm already has to ignore any attributes not
> directly inherited by the target version, and if the attribute is
> inherited then so is the atom table entry.  This is really subtle, and
> against that subtlety (read: bug attractor) we had better have a strong
> argument in favor of atom table versioning.
>
> If the atom table is not versioned, then Tux3 will use the same atom
> number across all versions, and better keep track of how many times
> each atom is used so that an unused atom table entry can eventually be
> recovered and recycled.  If the atom table is versioned then it is
> tempting to just let atom numbers hang around forever regardless of
> whether they are used in a particular version or not.  But on the other
> hand, one can imagine a load that creates a stupidly large number of
> different xattr types, turning the atom table into a gigantic thing
> that never goes away even if all those xattrs are immediately deleted.
> I would call this a strong argument in favor of tracking atom "link
> count" even if the atom table is not versioned.
>
> Updating atom link count on disk may or may not be expensive.  I think
> forward logging serves as our "bat thermal underwear" here, and will
> protect Tux3 nicely against performance problems due to persistent atom
> use count updating.  Anyway, ideas on this topic are welcome.  I have
> still not decided whether atoms really need to be use-counted.  They
> will not be in the first prototype, and the atom table will not be
> versioned.
>
> Why does Tux3 want to use atom numbers at all for encoding xattrs,
> when all other filesystems I know of just store the ascii name of each
> xattr along with the xattr data?  Tux3 is going to have to resolve the
> xattr name on every xattr operation anyway, so performance is not an
> argument, at least not obviously.  Advantages are:
>
>  * Compactness.  This indirectly affects performance by reducing the
>   amount of inode data that has to be loaded, scanned, updated and
>   saved.
>
>  * Regularity of attribute structure.  The attribute versioning
>   algorithms are quite complex, so the more similar different kinds
>   of attributes are to other objects, the fewer corner cases to worry
>   about way down at the lowest access level where the action happens.
>
>  * Future algorithms that operate internally to Tux3, to filter
>   inodes by extended attribute for one reason or another, possibly
>   security.  If such internal use of attribute numbers turns out to
>   be advantageous, it could happen that concept of attribute name
>   resolution is lifted up into the vfs, creating new optimization
>   possibilities, and possibilities to share attribute resolution code
>   across filesytems (there has been some attempt at this in Ext2/3/4
>   xattr processing, which introduced the concept of a block cache to
>   assist in detecting and sharing attribute blocks that are exactly
>   the same.  I wonder about the efficacy of that optimization.)
>
> This is yet another case where Hammer's fat btree keys permit a nice,
> regular approach to inode attributes, whereas in Tux3 we have to work
> a little harder.  On the other hand, Hammer has to work a little harder
> to collect together all the xattrs when loading an inode, whereas Tux3
> always has them immediately available once it finds its way to the
> correct inode table block.  On the third hand, if we are ignoring
> xattrs and a high percentage of inodes have them (why are we ignoring
> them again?) then loading an inode gets more expensive.  We might wish
> to set xattrs off into separate inode table blocks if this turns out
> to be an issue.  For now, we are just going to mix them together with
> other inode attributes and see how it goes.
>
> To achieve any kind of performance, xattrs have to be cached above the
> inode table block level.  It simply will not do to have Tux3 go probing
> through the inode table btree index for each and every getxattr call.
> My solution is to link each xattr onto a list owned by struct inode at
> the time the inode is loaded into cache.  Xattrs should then exhibit
> similar cache performance to inodes themselves, without requiring any
> new cache mechanism.  But the cache footprint of an inode will be a lot
> bigger... because the xattrs are cached.  That seems like a fair trade
> to me.  If it turns out to be an issue then we can add an "xattrs
> present" flag or even a bitmap to the inode so that xattrs are cached
> lazily, which is to say, not when the inode itself is requested, but
> the other way round: a getxattr will load the inode and cache at least
> the requested xattr.
>
> A large xattr, which is to say, big compared to the size of an inode
> table block, is stored in a btree or directly indexed block just like
> an inode data attribute.  In kernel, that means creating a "mapping"
> (aka address_space, aka page cache) for the xattr.  In Tux3 userspace
> we do the same, somewhat more cleanly, but essentially the same.  Then
> the per-inode xattr list includes the in-memory data attribute forms
> in place of the immediate attribute data, and of course access takes a
> longer path.  I think we are winning at this point, as we are caching
> all forms of xattr data by the same machinery as file data, and in
> particular, big xattrs are cached, paged, reloaded, and etc by the same
> generic and heavily optimized mechanism as file data.
>
> Detecting identical xattr values is an intriguing idea that is sort of
> implemented and working in Ext3, though I do not know how well.  I do
> not intend to attempt such a thing in Tux3 just now, mainly because I
> do not see why such a content-addressed compression scheme should be
> limited just to xattrs.  Why not all file data?  Let's just leave that
> out for now and see if there is a benchmarking argument for trying to
> do something a little special there, later.  I like Shapor's idea of
> having files inherit attributes from the directory they are initially
> created in.  So inherited attributes don't actually have to be stored
> per file until the exact match is broken.  This inheritance idea is
> somewhat analogous to version inheritance, so I sense something that
> looks like design synergy.
>
> Like a data btree, xattrs have their own variant of get_block so that
> large xattrs can be accessed stream-wise via read/write.  This variant
> just resolves the xattr by looking up the atom and searching the inode
> xattr list to find the xattr btree, before proceding exactly as for a
> data attribute.  This mechanism is forward-looking in that Linux has no
> interface for streaming xattr reads at present.  But handling this in
> advance in Tux3 will be a strong argument for adding such an interface.
> Essentially, Tux3 merges the concept of xattr and file fork internally,
> and hopefully provides a strong argument by example of why Linux should
> acquire an external API to accomodate this.
>
> For that matter, Linux needs "getxattrat" or similar to retrieve an
> xattr from an inode that is already open.  After all, the most common
> xattrs are acls, and these were already retrieved when the inode was
> opened to check the rights of the calling task.  It is surprising that
> we do not already have such an interface, or it is an indication that
> xattrs are not in extensive use yet.  Tux3 can do a lot to help there,
> but in order to promote such a userspace API improvement, Tux3 better
> perform well as a basic filesystem.  Which has been the plan all along.
>
> Now let's see if we can translate these xattr thoughts into working
> code, and hopefully while I am doing that one of our two fuse
> implementations will progress to the point where we can easily add
> xattr support and get some proper testing underway.
>
> Regards,
>
> Daniel

Daniel,

This reminds me of something I've been thinking about for a while....

I've noticed most filesystems have relatively little diversity in file
attributes (especially within a directory), so we have lots of
duplicated bits of attribute metadata.  For example, an email system
with "virtual" accounts (not tied to real Unix users) may have
millions of files with the exact same user/group/mode (Maildirs).
With Tux3, if the inodes didn't explicitly track the extra 6 or so
bytes of user/group/mode data per entry, we could see a potential 25%
reduction in size of our already compact inodes.

After first reading this post, I thought the right approach may be to
combine xattrs and user/group/mode in to a single attribute atom table
which could grow dynamically in addressability (with 2 or 3 levels).
However, I think an inheritance model would work better.  With atoms,
it is possible for any user (malicious or not) to grow the atom table
significantly.  Updating reference counts also sounds complex, with a
lot of corner cases.

Initially, I thought we could track user/group/mode defaults on a
per-directory basis, but discarded this due to the inability to
(easily) map an inode to a parent directory (not to mention hard
links, duh).  It would be possible, however, to have attribute
defaults for inode table blocks (or higher level branches of the tree,
even).  If we did that, it could lessen the need for a more complex
atom based approach.

I suppose the inheritance and atom approaches could be combined or
chosen based on how the filesystem is being used, but that sounds
exponentially complex. :)

Regards,
Shapor

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3