[Tux3] Feature interaction between multiple volumes and atomic update
Daniel Phillips
phillips at phunq.net
Fri Aug 29 17:41:43 PDT 2008
It turns out that multiple independent volumes sharing the same
allocation space is a feature that does not quite come for free as I
had earlier claimed. The issue is this:
* Tux3 guarantees that when fsync (or other filesystem sync) returns
then the entire volume including all subvolumes is in a consistent
state. In particular, any block in use by the subvolume being
synced is persistently recorded as in use, and no block that is not
in use by (the persistent image of) any subvolume is recorded as in
use.
* It is desirable that a fsync apply only to the subvolume being
synced, even if other subvolumes are mounted and in use at the same
time. Otherwise, syncing a given subvolume would require time
proportional to the number of subvolumes simultaneously mounted,
which would be a regression compared to having the volumes actually
separate. Since the multiple subvolume feature has a marginal use
case anyway, such a drawback would verge on being fatal for this
feature.
* Therefore it seems logical that Tux3 should have a separate forward
log for each subvolume to allow independent syncing of subvolumes.
But global allocation state must always be consistent regardless of
the order in which subvolumes are synced.
* We do not want to have a separate log dedicated to block allocation
because that would require updating two logs in many cases where
only one log update would otherwise be required.
* An unexpected interruption may occur when any combination of
subvolumes is mounted and active. But on restart, nothing requires
that the same set of subvolumes be remounted.
* If a subvolume is not mounted, then it is not desirable for Tux3 to
recreate the cache state of that subvolume. Recreating cache state
is fundamental to the Tux3 integrity recovery design. In other words,
we do not want to replay the log into cache for every subvolume that
was mounted at the time of a crash.
So what do we do? Some ideas:
1) Drop the multiple subvolume feature.
2) When the first subvolume is remounted after a crash, scan all other
subvolumes for allocation changes, roll those up into a dedicated
allocation log, and mark in the dedicated allocation log the highest
log sequence numbers of the subvolume logs that were rolled up into
the allocation log.
3) When the first subvolume is remounted after a crash, implicitly
remount and replay all subvolumes that were also mounted at the time
of the crash, roll up the logs, and unmount them.
4) Partition the allocation space so that each subvolume allocates
from a completely independent allocation space, which is separately
logged and synced. Either implement this by providing an
additional level of indrection so that each subvolume has its own
map of the complete volume which may be expanded from time to time
by large increments, or record in each subvolume allocation map
only those regions that are free and available to the subvolume.
I am tending towards solutions 2 or 4 at the moment, though there are
no doubt other approaches I have not considered. The main goal is to
avoid such complexity as to devalue the attractiveness of the subvolume
feature, which as I said earlier is not a feature anybody has actually
asked for.
Solution 4 seems to encroach on the territory of the volume manager,
something Tux3 wishes to avoid. We would be better advised to improve
the volume manager so that it is capable enough to provide such
incremental allocation itself in a way that maps well to the needs of
filesystems such as Tux3.
I CC'd this one to Matt Dillon, perhaps mainly for sympathy. Hammer
does not have this issue as it does not support subvolumes, perhaps
wisely.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list