[Tux3] Design note: The get_segs interface

Tue Dec 16 16:07:18 PST 2008

The handling of data file extents in filemap.c has up till now been 
monolithic and far from crystal clear.  The new get_segs interface 
imposes some structure in a way that maps pretty well to both kernel 
and userspace.  This interface will play a central role in atomic 
commit, which is the "Christmas Project".

The job of get_segs is to take a logical range of file blocks and fill 
in a vector of physical extents that specify how the logical ranges 
maps to physical blocks.  For write, get_segs allocates blocks as 
necessary, and for read, indicates regions that map to holes in a 
sparse file.

A lot of the work get_segs has to do is similar or identical for read 
and write.  It probes a btree to find a dleaf containing extent 
entries, finds the right place in the dleaf, then scans through the 
existing extents, adding segments that are found to the list of output 
segments, or creating new ones as necessary.

Gaps found between existing segments are treated differently for read 
and write: read adds them as holes to the output segment list while 
write allocates new regions of blocks to fill the gap.

After completing a list of output segments, read is done, but write 
needs to update the data btree leaf with its changes.  At this point, 
it may discover that the new list of segments will not fit in the 
existing leaf, so it splits the leaf.  (While writing this post, I 
noticed a gigantic bug in filemap.c: as currently implemented, get_segs 
just repeats the whole scan and allocate, thus leaking all the blocks 
it allocated on the first attempt... moral of the story is, writing 
posts about computer code is good!)

After a list of physical extents is retrieved from get_segs, the caller 
does the necessary IO.  For userspace, this means calling 
diskread/diskwrite in a loop across segments, while for kernel, the 
block IO library fills in buffer details or attaches pages to a bio for 
asynchronous transfer to disk.  Holes on read are handled by filling 
the cached block with zero.

This strategy works pretty well for traditional write-in-place behavior.  
But we also need to support redirect-on-write, that is, reallocate any 
existing extents to new physical positions to avoid overwriting parts 
of an existing, consistent filesystem image on disk.  Redirect is 
required for snapshotting, and to support stronger-than-Posix semantics 
for atomic update of file data, along the lines of Ext3's data=journal 
mode.

To support redirect, get_segs has to allocate new extents not only for 
holes, but also to replace existing extents.  The replaced extents may 
have to be freed, but not immediately: those blocks must not be 
reallocated until after the current delta has fully arrived on disk.  
(When we have versioning, existing blocks are freed only when they 
become unused by any version.)  So get_segs will need a way of 
reporting the list of freed blocks, which it does not have yet.  But it 
also does not need it right now, because we do not have versioning at 
the moment, and a weaker form of atomic commit similar to Ext3's 
data=ordered mode will serve our immediate needs.

Just looking at the code (again, writing about this helps) get_segs 
could easily be refactored so that gaps are returned as holes for 
write, just like read, and the existing update_tree code can be pulled 
out into a separate function that calls get_segs, then does the 
write-specific processing.

In spirit, get_segs resembles the traditional ->get_block interface
used by the block IO library, but has an interface better suited to the 
problem rather than staying wedded to an ancient existing structure 
(buffer_head) as ->get_block does.  It works well as a way of gluing 
the traditional ->get_block interface onto our model, as can be seen 
from the implementation of tux3_get_block in filemap.c.

For mapping big regions with complex physical allocation patterns, 
get_segs provides a much more efficient interface than the existing 
block-at-a-time arrangement using buffer_heads, which was somewhat 
painfully extended over time to handle multiple contiguous blocks in 
certain circumstances.

The get_segs interface allows us to invert the traditional block library 
approach of calling ->get_blocks multiple times to build up a 
multi-block IO transfer.  Instead, we retrieve the physical mapping for 
a big logical region in one call to get_segs, with just one probe into 
the btree.

The current implementation of get_segs has a few deficiencies, still 
being worked on:

  - Existing extents that overlap the beginning or end of the logical
    region are not handled correctly on write.

  - The inner loop on extents always stops after the first extent,
    which was to keep it compatible with Hirofumi's functional
    implementation of tux3_get_block.  Now the original loop can be
    restored, and we just tell get_segs to return one segment at a
    time.

In fact, the segments I talk about here are really just extents.
When the interface settles down and proves itself out, the natural
thing to do is rename it as get_extents, which could potentially
evolve into a new, improved block library interface better suited
to modern filesystems.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3