[Tux3] Extent support has landed
Daniel Phillips
phillips at phunq.net
Fri Oct 3 09:03:55 PDT 2008
Tux3 now has extent support, for now and evermore. I flipped over the
#defines to use extents for more than just unit testing. The tux3 user
command now runs with extents and so do both of the Fuse versions. I
did not test any of these! Somebody, kindly check and see what breaks.
I did test extents very lightly using the inode.c unit test, which
successfully writes and reads back "hello world" and somewhat more
exhaustively using the filemap.c unit tests, but this is still very
green code. I expect a number of bugs to turn up.
Developing the extent support was by no means a cut and paste job. On
the contrary it was nearly three weeks of grinding slow work. This is
new territory, quite unlike any filesystem code I have written before,
and there was precious little guidance out there on the net. The
combinatorics are fairly horrifying as I touched on in an earlier post.
There will be a longish post coming soon on the extent machinery and
the new api I created for processing and editing extents, which worked
out pretty well. For now I will briefly describe how the filemap
operations are structured.
Both buffer flush and buffer read are handled by the same function,
filemap_extent_io, because much of the code is exactly the same. No
sense in letting thr common parts drift apart and end up making us fix
bugs twice.
The extent io code is driven by the block-at-a-time bread and bwrite
interfaces, which is perverse but it is much the way things work in
kernel at the moment. Naturally, we would really like big reads and
writes to drive the extent interface directly, but if we choose to go
that route we will basically get stuck with the task of re-implementing
the whole kernel generic_read/write family of functions. A big job,
and then we will be stuck with maintaining it. Maybe we will
eventually do that in the interest of yet more performance, but for now
we put up with being called one buffer at a time, and while handling
the one buffer we check to see if some neighboring buffers can also be
included to form extents. This opportunistic extent formation is
handled by the guess_extent function.
Once we have the extent, we probe into the btree, 64 blocks below the
beginning of the extent, which ensures that no existing extents that
may overlap the io extent are missed. (I just realized there is a bug
here: we might have to advance to the next block and there is no code
to do that yet.)
We walk forward through the leading extents until we get to one that
overlaps or begins exactly at the io extent. Then a slightly tricky
bit of code walks across the pre-existing extents taking note of any
gaps between them. For write, blocks are allocated to fill the gaps
and for read, the unmapped extents are noted. In either case, all
prior extents found and the new extents created to fill the gaps are
saved in a "segs" vector.
Up till here, read and write are nearly identical. The next few steps
are specific to write. We add the remaining extents all the way to the
end of the dleaf block to the segs vector. (This will be inefficient
in the general case, but does not really matter for now because the
tail of the dleaf will usually be empty. This lazy approach will
eventually be fixed, but at the moment it would be a lot of distracting
work when there are more pressing issues to address.)
Continuing in the write-specific code, we use the dwalk_mock function to
figure out how much additional space will be needed for the new
extents. Dwalk_mock works just like dwalk_pack, but only calculates
the space that will be required without modifying the leaf. Then, if
insufficient space is available, we split the leaf and retry the whole
write up to this point. (And I just noticed... repeating the probe,
which is unnecessary.)
Continuing on in the write specific code, we truncate the dleaf at the
point just before the io extent and append the new set of extents from
the segs list, which for write includes all the extents above the io
extent to the end of the leaf. (Once we get really good at this we
will just memmove the tail of the leaf instead of adding it to the segs
vector then packing the segs back in one at a time, which code will be
a little tricky.)
Then we do the actual IO, a loop across the segs list that differs only
slightly between read and write, mainly in that buffers mapping to
empty segments for read are zero filled here, and of course the io
direction differs. In kernel we would be setting up a bio here, which
can do the whole job in one transfer. In userspace we would use
preadv/pwritev if Linux had them, but Linux does not, so we use our
trusty diskread/diskwrite routines instead. (We are not really aiming
for performance in this userspace code, just correctness, and code that
will be fast in kernel. Still, if preadv/pwritev had been available I
would have used them.)
With this new extent code enabled, I broke truncate, which still thinks
it is dealing with block pointers. It will do the wrong thing if the
truncate lands in the middle of an extent. Probably nobody will notice
even with Fuse, because truncate to zero should still work just fine.
So sigh, this is done. Nearly.
Extents are nothing more than a performance hack, but Tux3 needs to
benchmark well in order to thrive in the filesystem jungle, and it will
be helpful if it benchmarks well right from the day it lands in kernel.
Plus, I would rather not do the versioning work twice, once for
pointers and again for extents. With the arrival of extents we also
gained a nice api for dleaf editing, which will be a theme to build on
when the even trickier versioning code starts to land.
Speaking of performance, extents actually reduce it for files that are
only one block long. We will have to get busy and do some optimization
for the one block file case. I think it is quite optimizable and so we
should eventually get it to the point where the overhead vs single
block IO is not noticeable. It will not be a huge difference anyway,
but as it stands I think it could be measured and might put us a little
behind a non-extent filesystem like Ext3 for some loads, until we fix
it. For two block files and larger, the extent code is a clear winner
by any measure: cpu, cache or disk space.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list