[Tux3] Design note: Atomic filesystem changes

Tue Dec 30 15:55:23 PST 2008

Tux3 implements atomic filesystem changes by ensuring that related 
changes to disk blocks are never split across deltas.  For example, an 
unlink changes a directory entry block and an inode table block (and 
bitmaps as well, but that is another topic).  The changed versions of 
these blocks should be written in the same delta.  Similarly, when a 
data block is allocated and written, then the dleaf that points to it 
should be written in the same delta.

It would be a slight exaggeration to use the term transaction here, so I 
will say change instead.  The idea is that, when Tux3 decides a delta 
transition is needed for one of several possible reasons, all parallel 
changes in progress must finish, then the delta counter is incremented 
by one of the tasks.  Whichever task increments the delta counter also 
takes care of staging and committing the delta.  In this way, Tux3 will 
run without a dedicated daemon doing delta transitions, a nicety that 
avoids extra task in the task list and also avoids some typical 
deadlocks that are possible with daemons, which need tricky code to 
prevent.  On the other hand, it means that some random task will 
sometimes have higher latency for one of its buffered filesystem 
operations, because it took care of the delta transition.  This is not 
really anything unusual for Linux, which does something similar with 
memory scanning, but it is worth keeping in mind that we may want to 
add an optional daemon at some point.  Because delta transitions will 
be relatively under heavy load, we could even fork a task just to do a 
single transition and probably nobody would notice.

The delta counter is used to set the dirty state of a block buffer.  The 
block buffer dirty state actually has the low couple of bits of the 
delta counter, because there may be up to three deltas in flight at any 
given time (possibly four in the future).  Each time a block buffer is 
changed, the delta counter is compared to the dirty state.  If the 
buffer is dirty in a previous delta then it will be "forked", that is, 
its original data removed from the buffer hash and replaced by a copy 
that can be freely altered without affecting the original.  The 
original data remains attached to the earlier delta, on its way to 
disk.

Here is skeleton code for atomic filesystem changes:

struct sb {
	...
	atomic_t delta;
	struct rw_semaphore delta_lock;
	...
};

void begin_change(struct sb *sb, unsigned reserve)
{
	wait_for_reserve(sb, reserve);
	down_read(&sb->delta_lock);
}

void end_change(struct sb *sb, unsigned delta)
{
	down_read(&sb->delta_lock);
	if (need_delta(sb)) {
		unsigned delta = atomic_read(&sb->delta);
		up_read(&sb->delta_lock);
		down_write(&sb->delta_lock);
		if (sb->delta == atomic_read(&delta)) {
			atomic_inc(&sb->delta);
			up_write(&sb->delta_lock);
			stage_delta(sb, delta);
			commit_delta(sb, delta);
		} else
			up_write(&sb->delta_lock);
	} else
		up_read(&sb->delta_lock);
}

Error handling omitted.  At the down_write the read lock is released and 
multiple parallel transactions may be trying to get the write lock.  
They will all eventually succeed, but only one of them will detect that 
the delta counter is unchanged and increment it.

In end_change, the delta_lock is held exclusive for a very short time, 
which is good because it will be a very busy lock.  There are more 
atomic operations than I would prefer, and I do not like the way tasks 
can pile up on the write_lock.  I also wish there were a way to keep 
operations running in parallel during the delta transition, but that 
seems hard.  Like the SMP locking, this will improve over time.  
However, it should be usable and efficient enough for our immediate 
needs.

All the same operations that were recently enclosed in spinlocks will 
now also be enclosed in begin_change/end_change, with the addition of 
directory operations and the possible exception of allocation bitmap 
operations, more on that later.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3