[Tux3] Feature interaction between multiple volumes and atomic update

Sat Aug 30 02:56:32 PDT 2008

On Friday 29 August 2008 20:31, Matthew Dillon wrote:
> :It turns out that multiple independent volumes sharing the same
> :allocation space is a feature that does not quite come for free as I
> :had earlier claimed.  The issue is this:
> :...
> : * Therefore it seems logical that Tux3 should have a separate forward
> :   log for each subvolume to allow independent syncing of subvolumes.
> :   But global allocation state must always be consistent regardless of
> :   the order in which subvolumes are synced.
> 
>     I had a lot of trouble trying to implement multiple logs in HAMMER
>     (the idea being to improve I/O throughput).  I eventually gave up
>     and went with a single log (well, UNDO fifo in HAMMER's case).  So
>     e.g. even though HAMMER does implement pseudo-filesystem spaces
>     for mirroring slaves and such, everything still uses a single log
>     space.

The concept of log in Tux3 is a little different, consisting of mini
transactions placed wherever the data goal happens to be, with a commit
block placed opportunistically somewhere near each transaction body.  I
think this lends itself to parallelizing, though I agree that starting
off that way would be an unwise implementation choice.

At first there will be exactly one replay order, though transactions
will not necessarily be written in that order.  What I see is throwing
several hundred mini transactions (I need to come up with a name for
such things) at the block device in parallel, but waiting on
completions in a linear order, in turn unblocking waiters in a linear
order.  This is to avoid such a faux pas as allowing an fsync to
complete when there would be gaps in the linear replay sequence on
which the fsync depends.

> : 3) When the first subvolume is remounted after a crash, implicitly
> :    remount and replay all subvolumes that were also mounted at the time
> :    of the crash, roll up the logs, and unmount them.
> 
>     If you synchronize the transaction id spaces between the subvolumes
>     then the crash recovery code could use a single number to determine
>     how far to replay each subvolume.  That sounds like it ought to work.

Right, that is not too hard.  What I don't like about (3) is the idea
that a bunch of work will be done on behalf of subvolumes that are not
even being asked for yet, which means that the one that actually is
being mounted will have to wait for some perhaps irrelevant work to be
done before it gets back online.

Another thing I don't like about this is, it violates the design
decision that there is no classic "replay" on mount with Tux3, only
recreation of the relevant cache state.  The cache state of a subvolume
that is not yet being mounted does not qualify as relevant, and yet
it would be recreated anyway, then discarded.

> : 4) Partition the allocation space so that each subvolume allocates
> :    from a completely independent allocation space, which is separately
> :    logged and synced.  Either implement this by providing an
> :    additional level of indrection so that each subvolume has its own
> :    map of the complete volume which may be expanded from time to time
> :    by large increments, or record in each subvolume allocation map
> :    only those regions that are free and available to the subvolume.
> 
>     I tried this in an earlier HAMMER implementation and it was a
>     nightmare.  I gave up on it.  Also, in an earlier iteration, I
>     had a blockmap translation layer to support the above.  That 
>     worked fairly well as long as the blocks were very large (at least
>     8MB).  When I went to the single global B-Tree model I didn't
>     need the layer any more and devolved it back down to a simple
>     2-layer freemap.

Number 4 is a service that any volume manager should be able to
perform.  Unfortunately on Linux, the volume manager performs the job
poorly, not in a way that a filesystem can control on the fly.  So a
filesystem cannot expand itself, there is just no suitable internal
interface to direct the volume manager to do its part of the work.  I
presume that a similar sad situation must exist on Solaris, which set
the stage for ZFS taking on the role of volume manager itself, a
factoring that I find... a little disturbing.

I am now leaning towards the idea of dropping subvolumes.  There is
exactly one advantage I was able to think of that would be out of scope
of what a volume manager could do, which is the idea of "budding" and
"melding" directories to/from separate volumes.  And I hardly think
that having multiple subvolumes available is the only way to do that.

Subvolumes can always be added later if there turn out to be real use
cases that cannot possibly be performed by an improved volume manager.

> :I CC'd this one to Matt Dillon, perhaps mainly for sympathy.  Hammer
> :does not have this issue as it does not support subvolumes, perhaps
> :wisely.
> 
>     Yah.  We do support pseudo-filesystems within a HAMMER filesystem,
>     but they are implemented using a field in the B-Tree element key.
>     They aren't actually separate filesystems, they just use totally
>     independant key spaces within the global B-Tree.

I envy the lovely symmetry of your fat key space.  But I am convinced
that the extra complexity of a hierarchical structure encoded with
narrow keys (48 bits to handle exabyte filesystems) will pay off in
cache performance.  And the complexity is not actually all that much.
Most of the structural details are in place now and the code base is
about 5,000 lines, about 1/3 of that is unit tests, and the buffer
emulation is about 10%.  I expect the finished size of the kernel code
to come in around 10,000 lines but I am guessing wildly about how
complex the fsync will be, the details of which are just starting to
take shape.

>     We use the PFSs as replication sources and targets.  This also allows
>     the inode numbers to be replicated (each PFS gets its own inode
>     numbering space).

Your PFS sounds like what I would call a snapshot.

Indeed, it is essential to replicate inode numbers faithfully if you
want to export via NFS on the downstream side.  We got that for free in
ddsnap, which replicates an entire volume, but I had overlooked that
detail so far in thinking about filesystem-based replication.  Ooh.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3