[Tux3] Atom refcounting redux

Sat Sep 13 13:26:04 PDT 2008

Continuing the theme of obsessing over atom refcounting, and thus over
the viability of the xattr atom concept in general, we now have a
servicable implementation of persistent refcounting.  Nearly the entire
cost and complexity comes down to this one function:

int use_atom(struct inode *inode, atom_t atom, int use)
{
	unsigned shift = inode->sb->blockbits - 1;
	unsigned block = inode->sb->atomref_base + 2 * (atom >> shift);
	unsigned offset = atom & ~(-1 << shift);
	struct buffer *buffer;

	if (!(buffer = bread(inode->map, block)))
		return -EIO;
	int low = from_be_u16(((u16 *)buffer->data)[offset]) + use;
	((be_u16 *)buffer->data)[offset] = to_be_u16(low);
	if ((low & (-1 << 16))) {
		brelse_dirty(buffer);
		if (!(buffer = bread(inode->map, block + 1)))
			return -EIO;
		int high = from_be_u16(((be_u16 *)buffer->data)[offset]) + (low >> 16);
		((be_u16 *)buffer->data)[offset] = to_be_u16(high);
	}
	brelse_dirty(buffer);
	return 0;
}

Not very complicated.  Most of the noise in that function is due to
endian conversion.  Without the endian conversions it reads a little
more easily:

int use_atom(struct inode *inode, atom_t atom, int use)
{
	unsigned shift = inode->sb->blockbits - 1;
	unsigned block = inode->sb->atomref_base + 2 * (atom >> shift);
	unsigned offset = atom & ~(-1 << shift);
	struct buffer *buffer;

	if (!(buffer = bread(inode->map, block)))
		return -EIO;
	int low = ((u16 *)buffer->data)[offset] + use;
	((u16 *)buffer->data)[offset] = low;
	if ((low & (-1 << 16))) {
		brelse_dirty(buffer);
		if (!(buffer = bread(inode->map, block + 1)))
			return -EIO;
		((u16 *)buffer->data)[offset] += low >> 16;
	}
	brelse_dirty(buffer);
	return 0;
}

But of course the endian conversions are essential.  The thing is, any
way you look at it, it is not much code.  Most of the complexity is
just about dividing the refcount into two 16 bit pieces, so we can roll
up twice as many refcounts on average from the forward log into the
refcount map blocks.  The seven lines required by this secondary
optimization seem like a very good deal for that.

So our cost is essentially one bread and brelse_dirty per refcount
increase or decrease, which happens any time we add a new attribute to
an inode or remove one (by setting it to zero length) but not when we
just change its value.  And we will log the fact that we changed the
usecount if somebody calls fsync or if we need to flush cache, so there
is a bit of log pressure: the space used in the commit block and the
cost of rolling up into the recount blocks, hopefully many refcount
updates combined per block write at that point.

All these costs are pretty small actually, and I am nearly ready to
declare the atom refcounting idea a success.  It makes xattr storage
more compact and it costs only a modest amount of code, the bulk of
which is written now (still need to write some dumping code and the
reverse map for attribute listing, probably about 100 lines).  The
disk and cpu overhead looks ok, and may even be less than the savings
we will see from less cache pressure and attribute transfer overhead.

There is still a major optimization left to do here, which I will leave
for later because it is mainly just busy work and the only reason to do
it is winning benchmarks, which the main code base is not ready for
yet.  This optimization is: place a lazily loaded hash in front of the
existing use_atom mechanism so that atom use counts are just updated in
the hash.  So we don't have to bread, brelse_dirty or fiddle with
endian conversions at all: we only have to write out the log entries as
above.  This is because of the overarching design of Tux3's atomic
commit model.  All we require in memory is an up to date logical view
of the filesystem, it does not have to have exactly the block images
that a fully rolled up filesystem will have.  So it is good enough just
to update these recounts in the cache (with cpu cost in the tens of
nanoseconds) and just keep making our promises in the form of log
records that we will eventually update the respective table blocks, in
bulk and at our convenience.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3