[Tux3] Extent support has landed

Fri Oct 3 09:03:55 PDT 2008

Tux3 now has extent support, for now and evermore.  I flipped over the 
#defines to use extents for more than just unit testing.  The tux3 user 
command now runs with extents and so do both of the Fuse versions.  I 
did not test any of these!  Somebody, kindly check and see what breaks.

I did test extents very lightly using the inode.c unit test, which 
successfully writes and reads back "hello world" and somewhat more 
exhaustively using the filemap.c unit tests, but this is still very 
green code.  I expect a number of bugs to turn up.

Developing the extent support was by no means a cut and paste job.  On 
the contrary it was nearly three weeks of grinding slow work.  This is 
new territory, quite unlike any filesystem code I have written before, 
and there was precious little guidance out there on the net.  The 
combinatorics are fairly horrifying as I touched on in an earlier post.  
There will be a longish post coming soon on the extent machinery and 
the new api I created for processing and editing extents, which worked 
out pretty well.  For now I will briefly describe how the filemap 
operations are structured.

Both buffer flush and buffer read are handled by the same function, 
filemap_extent_io, because much of the code is exactly the same.  No 
sense in letting thr common parts drift apart and end up making us fix 
bugs twice.

The extent io code is driven by the block-at-a-time bread and bwrite 
interfaces, which is perverse but it is much the way things work in 
kernel at the moment.  Naturally, we would really like big reads and 
writes to drive the extent interface directly, but if we choose to go 
that route we will basically get stuck with the task of re-implementing 
the whole kernel generic_read/write family of functions.  A big job, 
and then we will be stuck with maintaining it.  Maybe we will 
eventually do that in the interest of yet more performance, but for now 
we put up with being called one buffer at a time, and while handling 
the one buffer we check to see if some neighboring buffers can also be 
included to form extents.  This opportunistic extent formation is 
handled by the guess_extent function.

Once we have the extent, we probe into the btree, 64 blocks below the 
beginning of the extent, which ensures that no existing extents that 
may overlap the io extent are missed.  (I just realized there is a bug 
here: we might have to advance to the next block and there is no code 
to do that yet.)

We walk forward through the leading extents until we get to one that 
overlaps or begins exactly at the io extent.  Then a slightly tricky 
bit of code walks across the pre-existing extents taking note of any 
gaps between them.  For write, blocks are allocated to fill the gaps 
and for read, the unmapped extents are noted.  In either case, all 
prior extents found and the new extents created to fill the gaps are 
saved in a "segs" vector.

Up till here, read and write are nearly identical.  The next few steps 
are specific to write.  We add the remaining extents all the way to the 
end of the dleaf block to the segs vector.  (This will be inefficient 
in the general case, but does not really matter for now because the 
tail of the dleaf will usually be empty.  This lazy approach will 
eventually be fixed, but at the moment it would be a lot of distracting 
work when there are more pressing issues to address.)

Continuing in the write-specific code, we use the dwalk_mock function to 
figure out how much additional space will be needed for the new 
extents.  Dwalk_mock works just like dwalk_pack, but only calculates 
the space that will be required without modifying the leaf.  Then, if 
insufficient space is available, we split the leaf and retry the whole 
write up to this point.  (And I just noticed... repeating the probe, 
which is unnecessary.)

Continuing on in the write specific code, we truncate the dleaf at the 
point just before the io extent and append the new set of extents from 
the segs list, which for write includes all the extents above the io 
extent to the end of the leaf.  (Once we get really good at this we 
will just memmove the tail of the leaf instead of adding it to the segs 
vector then packing the segs back in one at a time, which code will be 
a little tricky.)

Then we do the actual IO, a loop across the segs list that differs only 
slightly between read and write, mainly in that buffers mapping to 
empty segments for read are zero filled here, and of course the io 
direction differs.  In kernel we would be setting up a bio here, which 
can do the whole job in one transfer.  In userspace we would use 
preadv/pwritev if Linux had them, but Linux does not, so we use our 
trusty diskread/diskwrite routines instead.  (We are not really aiming 
for performance in this userspace code, just correctness, and code that 
will be fast in kernel.  Still, if preadv/pwritev had been available I 
would have used them.)

With this new extent code enabled, I broke truncate, which still thinks 
it is dealing with block pointers.  It will do the wrong thing if the 
truncate lands in the middle of an extent.  Probably nobody will notice 
even with Fuse, because truncate to zero should still work just fine.

So sigh, this is done.  Nearly.

Extents are nothing more than a performance hack, but Tux3 needs to 
benchmark well in order to thrive in the filesystem jungle, and it will 
be helpful if it benchmarks well right from the day it lands in kernel.  
Plus, I would rather not do the versioning work twice, once for 
pointers and again for extents.  With the arrival of extents we also 
gained a nice api for dleaf editing, which will be a theme to build on 
when the even trickier versioning code starts to land.

Speaking of performance, extents actually reduce it for files that are 
only one block long.  We will have to get busy and do some optimization 
for the one block file case.  I think it is quite optimizable and so we 
should eventually get it to the point where the overhead vs single 
block IO is not noticeable.  It will not be a huge difference anyway, 
but as it stands I think it could be measured and might put us a little 
behind a non-extent filesystem like Ext3 for some loads, until we fix 
it.  For two block files and larger, the extent code is a clear winner 
by any measure: cpu, cache or disk space.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3