[Tux3] Standalone unit test for deferred free list
Daniel Phillips
phillips at phunq.net
Wed Jan 21 19:32:49 PST 2009
Hi,
To review, Tux3 uses a "redirect on write" strategy for atomic commit
where no part of a consistent disk image may be overwritten until the
next delta has fully arrived on disk. Instead, a new extent is
allocated for each write and the old extent is freed. But the old
extent cannot be freed immediately, otherwise it might be reallocated
and overwritten before the new consistent image has arrived on disk.
So the extent is put on a "deferred free" list to be freed after the
delta transfer completes.
Rather than kmalloc/kfree or set up a slab cache, I did this a little
differently, by allocating pages directly for the defree log and
linking them into a circular list through the page->private pointer.
Otherwise, the handling is a lot like the logging primitives: there is
a pointer to the next address to emit an entry and a limit to know when
to add a new page.
Synchronization is needed because deferred frees can happen in parallel.
Like logging, this will be a semaphore for now and eventually be
improved to be a spinlock. Taking and releasing the semaphore per
deferred free will likely be the largest overhead for this subsystem,
but is still not worth worrying about right now. We will do that kind
of optimization after entering review cycle.
The cute trick here is the single linked circular list that maintains a
queue of pages. This technique is probably overkill in terms of
efficiency for this application, however it is simple, general and
likely to be useful in the future, so I will describe how it works in
some detail.
The entire reason for using single linked lists instead of double linked
is to save a link field and associated updates. The main drawback is,
if you have a pointer to a given list element then deleting that
element requires searching on average half the list. This is O(N) in
the length of the list, whereas deletion from a double linked list is
O(1). But many applications do not require random deletion, and this
is one of them. We just want our list to act like a queue.
A single linked list as better at acting like a stack than a queue:
adding an element to the head of the list or removing the head element
are O(1), while appending an element to the tail is O(N) because it
requires traversing the list. However, if we make the list circular by
connecting the last element to the first, and keep a pointer to the
tail element instead of the head, appending to the tail is O(1), and so
is pushing onto the head. In fact, these are the same operation. The
only difference is, for an append we advance the tail pointer to point
at the new tail element.
While deletion of a random single linked list element is O(N), deletion
of the element following a given element is O(1). This is convenient
for a queue, because we have a pointer to the queue tail that can be
used to delete the queue head efficiently.
Here are the generic single linked list operations:
struct link { struct link *next; };
void link_add(struct link *node, struct link *list)
{
node->next = list->next;
list->next = node;
}
void link_del_next(struct link *list)
{
list->next = list->next->next;
}
That's all! These are modelled on the kernel list.h API, and we use
them in the same way, that is, container_of is used to find the
containing structure given a list link. Each link points directly at
another link struct and is therefore generic: there is no tie to any
specific containing struct type. We give up some type safety, in that
it is easy to make a mistake writing a container_of(link...) and the
compiler will not catch it - very strange things are likely to happen.
On the other hand, it is possible to write generic list processing
algorithms using these generic links, which is a welcome luxury in a
very non-abstract language like C.
A single linked link_add is identical to the double linked list_add,
except for not setting prev pointers. We do not provide a link_del
because that would require a loop, however link_del_next is easy.
Given a pointer to the tail element of a circular list "circ" and a new
element "node", appending "node" to the tail of "circ" is:
link_add(node, circ);
circ = circ->next;
Deleting the head element is:
link_del_next(circ);
Though it is probably a bad idea to write code that depends on it,
repeatedly deleting from a single element circular list is possible and
leaves the list unchanged.
Note that add and del do not care whether a list is circular or not.
Once a list is made circular by placing a single element on it that
points at itself, the list stays circular under any combination of add
and del operations.
The simplicity of this scheme is partly due to always having at least
one element on the queue, which is fine for the current application:
there is no point in completely emptying the queue at the end of a
delta, because most deltas will do at least a few deferred frees. If
some application needed a completely empty queue, a test for delete
from a single element queue and a test for append to an empty queue can
be added. Or perhaps more elegantly, add a dedicated queue head, which
is only a single memory cell: to delete the first element, do
link_del_next(circ->next). Other operations remain as above, and the
queue is empty when the circ pointer points at the head element.
From time to time a delta will do a massive number of deferred frees,
and it is nice to know that this simple mechanism has no upper boundary
other than memory availability, and remains perfectly efficient. In
terms of cache efficiency it is very good. It completely fills each
cache line before moving on to the next one. In general, a list of
vectors performs better than a linked list of single elements, which
dirties lot more cache to do the same thing. There is only one
allocation per 512 deferred frees. Because the allocation is a full
page, there is no chance of fragmenting a slab allocator, and extra
pages to handle an unusually large number of frees are always given
back to the system immediately on delta completion. In short, our
deferred free managing subsystem is a good kernel citizen.
The entire implementation of deferred free management is about 40 lines,
plus the few lines of single linked list support, which we may well use
for something else. I have attached a standalone userspace unit test
program. This code should run unchanged in kernel, which I will now
verify.
Regards,
Daniel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: defree.c
Type: text/x-csrc
Size: 2904 bytes
Desc: not available
URL: <http://phunq.net/pipermail/tux3/attachments/20090121/2b1471d7/attachment.c>
-------------- next part --------------
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list