[Tux3] Standalone unit test for deferred free list

Wed Jan 21 19:32:49 PST 2009

Hi,

To review, Tux3 uses a "redirect on write" strategy for atomic commit 
where no part of a consistent disk image may be overwritten until the 
next delta has fully arrived on disk.  Instead, a new extent is 
allocated for each write and the old extent is freed.  But the old 
extent cannot be freed immediately, otherwise it might be reallocated 
and overwritten before the new consistent image has arrived on disk.  
So the extent is put on a "deferred free" list to be freed after the 
delta transfer completes.

Rather than kmalloc/kfree or set up a slab cache, I did this a little 
differently, by allocating pages directly for the defree log and 
linking them into a circular list through the page->private pointer.  
Otherwise, the handling is a lot like the logging primitives: there is 
a pointer to the next address to emit an entry and a limit to know when 
to add a new page.

Synchronization is needed because deferred frees can happen in parallel.  
Like logging, this will be a semaphore for now and eventually be 
improved to be a spinlock.  Taking and releasing the semaphore per 
deferred free will likely be the largest overhead for this subsystem, 
but is still not worth worrying about right now.  We will do that kind 
of optimization after entering review cycle.

The cute trick here is the single linked circular list that maintains a 
queue of pages.  This technique is probably overkill in terms of 
efficiency for this application, however it is simple, general and 
likely to be useful in the future, so I will describe how it works in 
some detail.

The entire reason for using single linked lists instead of double linked 
is to save a link field and associated updates.  The main drawback is, 
if you have a pointer to a given list element then deleting that 
element requires searching on average half the list.  This is O(N) in 
the length of the list, whereas deletion from a double linked list is 
O(1).  But many applications do not require random deletion, and this 
is one of them.  We just want our list to act like a queue.

A single linked list as better at acting like a stack than a queue: 
adding an element to the head of the list or removing the head element 
are O(1), while appending an element to the tail is O(N) because it 
requires traversing the list.  However, if we make the list circular by 
connecting the last element to the first, and keep a pointer to the 
tail element instead of the head, appending to the tail is O(1), and so 
is pushing onto the head.  In fact, these are the same operation.  The 
only difference is, for an append we advance the tail pointer to point 
at the new tail element.

While deletion of a random single linked list element is O(N), deletion 
of the element following a given element is O(1).  This is convenient 
for a queue, because we have a pointer to the queue tail that can be 
used to delete the queue head efficiently.

Here are the generic single linked list operations:

struct link { struct link *next; };

void link_add(struct link *node, struct link *list)
{
	node->next = list->next;
	list->next = node;
}

void link_del_next(struct link *list)
{
	list->next = list->next->next;
}

That's all!  These are modelled on the kernel list.h API, and we use 
them in the same way, that is, container_of is used to find the 
containing structure given a list link.  Each link points directly at 
another link struct and is therefore generic: there is no tie to any 
specific containing struct type.  We give up some type safety, in that 
it is easy to make a mistake writing a container_of(link...) and the 
compiler will not catch it - very strange things are likely to happen.  
On the other hand, it is possible to write generic list processing 
algorithms using these generic links, which is a welcome luxury in a 
very non-abstract language like C.

A single linked link_add is identical to the double linked list_add, 
except for not setting prev pointers.  We do not provide a link_del 
because that would require a loop, however link_del_next is easy.

Given a pointer to the tail element of a circular list "circ" and a new 
element "node", appending "node" to the tail of "circ" is:

	link_add(node, circ);
	circ = circ->next;

Deleting the head element is:

	link_del_next(circ);

Though it is probably a bad idea to write code that depends on it, 
repeatedly deleting from a single element circular list is possible and 
leaves the list unchanged.

Note that add and del do not care whether a list is circular or not.  
Once a list is made circular by placing a single element on it that 
points at itself, the list stays circular under any combination of add 
and del operations.

The simplicity of this scheme is partly due to always having at least 
one element on the queue, which is fine for the current application: 
there is no point in completely emptying the queue at the end of a 
delta, because most deltas will do at least a few deferred frees.  If 
some application needed a completely empty queue, a test for delete 
from a single element queue and a test for append to an empty queue can 
be added.  Or perhaps more elegantly, add a dedicated queue head, which 
is only a single memory cell: to delete the first element, do 
link_del_next(circ->next).  Other operations remain as above, and the 
queue is empty when the circ pointer points at the head element.

From time to time a delta will do a massive number of deferred frees, 
and it is nice to know that this simple mechanism has no upper boundary 
other than memory availability, and remains perfectly efficient.  In 
terms of cache efficiency it is very good.  It completely fills each 
cache line before moving on to the next one.  In general, a list of 
vectors performs better than a linked list of single elements, which 
dirties lot more cache to do the same thing.  There is only one 
allocation per 512 deferred frees.  Because the allocation is a full 
page, there is no chance of fragmenting a slab allocator, and extra 
pages to handle an unusually large number of frees are always given 
back to the system immediately on delta completion.  In short, our 
deferred free managing subsystem is a good kernel citizen.

The entire implementation of deferred free management is about 40 lines, 
plus the few lines of single linked list support, which we may well use 
for something else.  I have attached a standalone userspace unit test 
program.  This code should run unchanged in kernel, which I will now 
verify.

Regards,

Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: defree.c
Type: text/x-csrc
Size: 2904 bytes
Desc: not available
URL: <http://phunq.net/pipermail/tux3/attachments/20090121/2b1471d7/attachment.c>
-------------- next part --------------
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3