[Tux3] Allocation in allocation map flush revisited
Daniel Phillips
phillips at phunq.net
Mon Jan 12 20:27:41 PST 2009
As requested, here are some more details on the recursive allocation
problem, and how block forking solves it. Or nearly solves it.
I indulged in a little hubris when I claimed that the two shiny new
techniques of allocation logging and block forking would instantly
solve the allocation-during-bitmap-flush problem. While these
techniques do in themselves give us the stable bitmap table snapshot
that we require, the details of getting the right data on disk at the
right time turned out to be a little devilish.
I implemented a new unit test in user/commit.c to try out bitmap
flushing, and the initial runs showed just the right results. Nearly
ready to declare victory, I suddenly had new reservations, wrote a new
unit test, and found a flaw: in fact the bitmap data arriving on disk
was the same as if there were no fancy forking and logging at all. So
back to the drawing board.
The lazy solution would be just to ignore the problem. The difference
between an absolutely correct on-disk bitmap block snapshot and a
slightly later version into which additional allocations have leaked is
just a few bits. These are represented in the log by allocation
promises, and may or may not be set in the on-disk bitmap blocks
depending on the order in which the blocks were written. On replay,
the promised allocations are entered into the bitmaps to reconstruct
the current bitmap cache. If we quietly leave out the check that the
affected bits where not already set, everything just works and the
world is a happy place, right?
Except that I found it a little disturbing to leave out a safety check
just because of not being smart enough to place exactly the data on
disk that should be there. Even though Tux3 is not a nuclear reactor,
we still hate the idea going Chernobyl on our disk files, so one thing
we do not want to do is leave out a perfectly good safety check.
In the end, a precise solution only cost 15 new lines of code, although
determining exactly which 15 lines were the right ones took a couple of
days. A flock of Tux3 techniques came into play: block forking, the
log, allocation promises, index update promises, polymorphic block IO
transfers, per-delta dirty lists, per-inode dirty lists, and delta
transitions, all of which are also needed for other reasons. The
winning check-in is here:
http://hg.tux3.org/tux3/rev/42fe4237d34b
"A reworked fix to the allocate-during-bitmap-flush problem"
This solution is more or less as simple as I originally claimed.
Forking leaves the clone buffer carrying the original data on the dirty
list for the delta. The clone buffer points at the same allocation map
as the original and has the same index, however is not entered in the
hash map.
The problem I was having was due to mapping a buffer to disk and
submitting IO for it at the same time. The IO submitted would be for
the forked copy of the buffer data, sometimes with unwanted bits in it.
Also, buffers appearing later in the bitmap table dirty list might
already be forked, and again we would write out the current version of
the bitmap instead of the desired read-only clone. This was solved by
dividing the bitmap block flush into two steps: 1) map the buffer to
disk 2) write out the buffer unless its dirty state is now the current
delta, indicating that it was forked during the mapping. In short, we
just skip any forked blocks during the bitmap flush, then flush the
delta dirty list, and we are done.
Flushing the delta dirty list is a minor subproblem in itself. This
list may contain blocks from several kinds of structures, which are
mapped to disk in different ways. The main variants are logically
indexed buffers such as dirent blocks, and physically indexed buffers
such as btree nodes. Now we have a third variant: unhashed logically
indexed buffers, that is, clones resulting from buffer forks. The
normal method for writing logically indexed buffers (filemap_extent_io)
is not suitable because it looks up the buffer that will be written in
the mapping hash as part of an optimization to generate larger
transfers. This will write out the wrong data. So instead, the
allocation map gets its own IO mapping method, which uses the existing
file mapping method for read, but a simpler, single block method for
write. This method includes a check that the data being written does
not belong to the current delta, which has not yet been committed.
So here we find a useful application of the polymorphic mapping method
feature I have implemented in user space, and which I was never sure
had a solid reason to exist. Now it has made itself sufficiently
useful and deserves to be implemented in kernel as well.
Speaking of kernel, this solution needs to be implemented in kernel as
well. I am really glad I was able to bang away at this issue in user
space instead of kernel, with its longer development cycles and near
total lack of unit testing. All the while, I have been mindful that
the techniques employed are possible to implement in kernel with its
stringent SMP locking requirements and highly optimized internal
interfaces, which in some cases do not resemble what we do in user
space very much at all. In particular, block forking is a new and
untried technique for kernel, and far more difficult than user space
because of multiple blocks in different, asynchronously changing states
sharing the same underlying data page. We have to rip the data page
out from underneath a bunch of buffers and slip in a new one without
any of them noticing. Kind of like the trick where you pull the table
cloth out from underneath the dinner plates, so fast that nothing
crashes to the floor. Except that we also have to copy the table cloth
and slip the copy back underneath the dinnerware before it settles back
onto the table. Fun. We think this is possible, and pretty soon we
will find out for sure.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list