[Tux3] Allocation in allocation map flush revisited

Daniel Phillips phillips at phunq.net
Mon Jan 12 20:27:41 PST 2009


As requested, here are some more details on the recursive allocation 
problem, and how block forking solves it.  Or nearly solves it.

I indulged in a little hubris when I claimed that the two shiny new 
techniques of allocation logging and block forking would instantly 
solve the allocation-during-bitmap-flush problem.  While these 
techniques do in themselves give us the stable bitmap table snapshot 
that we require, the details of getting the right data on disk at the 
right time turned out to be a little devilish.

I implemented a new unit test in user/commit.c to try out bitmap 
flushing, and the initial runs showed just the right results.  Nearly 
ready to declare victory, I suddenly had new reservations, wrote a new 
unit test, and found a flaw: in fact the bitmap data arriving on disk 
was the same as if there were no fancy forking and logging at all.  So 
back to the drawing board.

The lazy solution would be just to ignore the problem.  The difference 
between an absolutely correct on-disk bitmap block snapshot and a 
slightly later version into which additional allocations have leaked is 
just a few bits.  These are represented in the log by allocation 
promises, and may or may not be set in the on-disk bitmap blocks 
depending on the order in which the blocks were written.  On replay, 
the promised allocations are entered into the bitmaps to reconstruct 
the current bitmap cache.  If we quietly leave out the check that the 
affected bits where not already set, everything just works and the 
world is a happy place, right?

Except that I found it a little disturbing to leave out a safety check 
just because of not being smart enough to place exactly the data on 
disk that should be there.  Even though Tux3 is not a nuclear reactor, 
we still hate the idea going Chernobyl on our disk files, so one thing 
we do not want to do is leave out a perfectly good safety check.

In the end, a precise solution only cost 15 new lines of code, although 
determining exactly which 15 lines were the right ones took a couple of 
days.  A flock of Tux3 techniques came into play: block forking, the 
log, allocation promises, index update promises, polymorphic block IO 
transfers, per-delta dirty lists, per-inode dirty lists, and delta 
transitions, all of which are also needed for other reasons.  The 
winning check-in is here:

   http://hg.tux3.org/tux3/rev/42fe4237d34b
   "A reworked fix to the allocate-during-bitmap-flush problem"

This solution is more or less as simple as I originally claimed.  
Forking leaves the clone buffer carrying the original data on the dirty 
list for the delta.  The clone buffer points at the same allocation map 
as the original and has the same index, however is not entered in the 
hash map.

The problem I was having was due to mapping a buffer to disk and 
submitting IO for it at the same time.  The IO submitted would be for 
the forked copy of the buffer data, sometimes with unwanted bits in it.  
Also, buffers appearing later in the bitmap table dirty list might 
already be forked, and again we would write out the current version of 
the bitmap instead of the desired read-only clone.  This was solved by 
dividing the bitmap block flush into two steps: 1) map the buffer to 
disk 2) write out the buffer unless its dirty state is now the current 
delta, indicating that it was forked during the mapping.  In short, we 
just skip any forked blocks during the bitmap flush, then flush the 
delta dirty list, and we are done.

Flushing the delta dirty list is a minor subproblem in itself.  This 
list may contain blocks from several kinds of structures, which are 
mapped to disk in different ways.  The main variants are logically 
indexed buffers such as dirent blocks, and physically indexed buffers 
such as btree nodes.  Now we have a third variant: unhashed logically 
indexed buffers, that is, clones resulting from buffer forks.  The 
normal method for writing logically indexed buffers (filemap_extent_io) 
is not suitable because it looks up the buffer that will be written in 
the mapping hash as part of an optimization to generate larger 
transfers.  This will write out the wrong data.  So instead, the 
allocation map gets its own IO mapping method, which uses the existing 
file mapping method for read, but a simpler, single block method for 
write.  This method includes a check that the data being written does 
not belong to the current delta, which has not yet been committed.

So here we find a useful application of the polymorphic mapping method 
feature I have implemented in user space, and which I was never sure 
had a solid reason to exist.  Now it has made itself sufficiently 
useful and deserves to be implemented in kernel as well.

Speaking of kernel, this solution needs to be implemented in kernel as 
well.  I am really glad I was able to bang away at this issue in user 
space instead of kernel, with its longer development cycles and near 
total lack of unit testing.  All the while, I have been mindful that 
the techniques employed are possible to implement in kernel with its 
stringent SMP locking requirements and highly optimized internal 
interfaces, which in some cases do not resemble what we do in user 
space very much at all.  In particular, block forking is a new and 
untried technique for kernel, and far more difficult than user space 
because of multiple blocks in different, asynchronously changing states 
sharing the same underlying data page.  We have to rip the data page 
out from underneath a bunch of buffers and slip in a new one without 
any of them noticing.  Kind of like the trick where you pull the table 
cloth out from underneath the dinner plates, so fast that nothing 
crashes to the floor.  Except that we also have to copy the table cloth 
and slip the copy back underneath the dinnerware before it settles back 
onto the table.  Fun.  We think this is possible, and pretty soon we 
will find out for sure.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list