[Tux3] Design note: Atomic filesystem changes
Daniel Phillips
phillips at phunq.net
Tue Dec 30 15:55:23 PST 2008
Tux3 implements atomic filesystem changes by ensuring that related
changes to disk blocks are never split across deltas. For example, an
unlink changes a directory entry block and an inode table block (and
bitmaps as well, but that is another topic). The changed versions of
these blocks should be written in the same delta. Similarly, when a
data block is allocated and written, then the dleaf that points to it
should be written in the same delta.
It would be a slight exaggeration to use the term transaction here, so I
will say change instead. The idea is that, when Tux3 decides a delta
transition is needed for one of several possible reasons, all parallel
changes in progress must finish, then the delta counter is incremented
by one of the tasks. Whichever task increments the delta counter also
takes care of staging and committing the delta. In this way, Tux3 will
run without a dedicated daemon doing delta transitions, a nicety that
avoids extra task in the task list and also avoids some typical
deadlocks that are possible with daemons, which need tricky code to
prevent. On the other hand, it means that some random task will
sometimes have higher latency for one of its buffered filesystem
operations, because it took care of the delta transition. This is not
really anything unusual for Linux, which does something similar with
memory scanning, but it is worth keeping in mind that we may want to
add an optional daemon at some point. Because delta transitions will
be relatively under heavy load, we could even fork a task just to do a
single transition and probably nobody would notice.
The delta counter is used to set the dirty state of a block buffer. The
block buffer dirty state actually has the low couple of bits of the
delta counter, because there may be up to three deltas in flight at any
given time (possibly four in the future). Each time a block buffer is
changed, the delta counter is compared to the dirty state. If the
buffer is dirty in a previous delta then it will be "forked", that is,
its original data removed from the buffer hash and replaced by a copy
that can be freely altered without affecting the original. The
original data remains attached to the earlier delta, on its way to
disk.
Here is skeleton code for atomic filesystem changes:
struct sb {
...
atomic_t delta;
struct rw_semaphore delta_lock;
...
};
void begin_change(struct sb *sb, unsigned reserve)
{
wait_for_reserve(sb, reserve);
down_read(&sb->delta_lock);
}
void end_change(struct sb *sb, unsigned delta)
{
down_read(&sb->delta_lock);
if (need_delta(sb)) {
unsigned delta = atomic_read(&sb->delta);
up_read(&sb->delta_lock);
down_write(&sb->delta_lock);
if (sb->delta == atomic_read(&delta)) {
atomic_inc(&sb->delta);
up_write(&sb->delta_lock);
stage_delta(sb, delta);
commit_delta(sb, delta);
} else
up_write(&sb->delta_lock);
} else
up_read(&sb->delta_lock);
}
Error handling omitted. At the down_write the read lock is released and
multiple parallel transactions may be trying to get the write lock.
They will all eventually succeed, but only one of them will detect that
the delta counter is unchanged and increment it.
In end_change, the delta_lock is held exclusive for a very short time,
which is good because it will be a very busy lock. There are more
atomic operations than I would prefer, and I do not like the way tasks
can pile up on the write_lock. I also wish there were a way to keep
operations running in parallel during the delta transition, but that
seems hard. Like the SMP locking, this will improve over time.
However, it should be usable and efficient enough for our immediate
needs.
All the same operations that were recently enclosed in spinlocks will
now also be enclosed in begin_change/end_change, with the addition of
directory operations and the possible exception of allocation bitmap
operations, more on that later.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list