[Tux3] More xattr design details - efficient atom refcounting

Fri Sep 12 01:49:34 PDT 2008

OK, I have decided to take a run at implementing reference counting for
atoms: the correct, no compromise approach, and maybe also efficient
enough that we do not have to make it optional.

The plan is this:

  * We have an atom reference count table which is broken into two
    parts, a table of the low 16 bits of the atom count, and a
    table of the high 48 bits that receives carries and borrows from
    the low 16 bit table.

  * Both tables are mapped into the atom table at a high logical
    offset.  Allowing 32 bits worth of atom numbers, and with at most
    256 atom entries per 4K dirent block, we need at most
    (32 << 8) = 1 TB dirent bytes for the atom dictionary, so the
    count tables start at block number 2^40 >> 12 = 2^28.

  * The low end count table needs 2^33 bytes at most, or 2^21 blocks,
    so the high count table starts just above it at 2^28 + 2^21
    blocks.

  * This all works for blocksize down to 2^9, and breaks at 2^8, but
    that is ok I think.

These blocks sit in the block cache (->map in Tux3 fake page cache,
->mapping in kernel) in a sparse file of course, so we do not actually
use huge amounts of space, just take advantage of the logical address
mapping to avoid having several files that which would have to be
created and synced separately.

So to update an atom count we need to bread the right count block,
add to the count, carry over into the high order count map if necessary
(very rare) then we are done.  If we care about the cost of hashing
into the map to find the block then we hold a use count on and a
pointer to the last count block accessed.  To start with we don't care
about that.

This leaves us with a dirty buffer or two in the atable, and we keep
doing that for a while until it comes time to sync to disk.  We will
mostly likely just have one block dirty, the count block, or two if
somebody created new xattrs.  We will not actually write out the count
block(s) that changed, but instead a list of [atom, +-count] pairs
tucked into the commit block that we have to write out anyway to do
atomic update on the inodes that we changed by adding, removing or
changing xattrs.  So far our refcounting has cost us nothing at all,
except a little space in a commit block that was probably going to be
wasted.

The extra cost comes when we have to come along later and roll up the
logs, including writing out the changed blocks of the atable in their
own atomic transaction.  The cost of refcounting is therefore exactly
the amount by which the atable blocks increase the cost of doing a
rollup.  Close to none at all I think, because in real cases we will
just be writing out one additional block per rollup, and we can defer
rollups as long as we want, because log rollup is not required for
fsync or even full filesystem sync.  We only need to roll up to limit
the time required to rebuild the cache state on restart, which I
expect can be at coarse intervals that will righteously amortize away
our incremental refcount block update cost.

Sound good?  It does to me.  Sound too good to be true?  Feel free to
poke away.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3