[Tux3] Design note: The get_segs interface
phillips at phunq.net
Tue Dec 16 16:07:18 PST 2008
The handling of data file extents in filemap.c has up till now been
monolithic and far from crystal clear. The new get_segs interface
imposes some structure in a way that maps pretty well to both kernel
and userspace. This interface will play a central role in atomic
commit, which is the "Christmas Project".
The job of get_segs is to take a logical range of file blocks and fill
in a vector of physical extents that specify how the logical ranges
maps to physical blocks. For write, get_segs allocates blocks as
necessary, and for read, indicates regions that map to holes in a
A lot of the work get_segs has to do is similar or identical for read
and write. It probes a btree to find a dleaf containing extent
entries, finds the right place in the dleaf, then scans through the
existing extents, adding segments that are found to the list of output
segments, or creating new ones as necessary.
Gaps found between existing segments are treated differently for read
and write: read adds them as holes to the output segment list while
write allocates new regions of blocks to fill the gap.
After completing a list of output segments, read is done, but write
needs to update the data btree leaf with its changes. At this point,
it may discover that the new list of segments will not fit in the
existing leaf, so it splits the leaf. (While writing this post, I
noticed a gigantic bug in filemap.c: as currently implemented, get_segs
just repeats the whole scan and allocate, thus leaking all the blocks
it allocated on the first attempt... moral of the story is, writing
posts about computer code is good!)
After a list of physical extents is retrieved from get_segs, the caller
does the necessary IO. For userspace, this means calling
diskread/diskwrite in a loop across segments, while for kernel, the
block IO library fills in buffer details or attaches pages to a bio for
asynchronous transfer to disk. Holes on read are handled by filling
the cached block with zero.
This strategy works pretty well for traditional write-in-place behavior.
But we also need to support redirect-on-write, that is, reallocate any
existing extents to new physical positions to avoid overwriting parts
of an existing, consistent filesystem image on disk. Redirect is
required for snapshotting, and to support stronger-than-Posix semantics
for atomic update of file data, along the lines of Ext3's data=journal
To support redirect, get_segs has to allocate new extents not only for
holes, but also to replace existing extents. The replaced extents may
have to be freed, but not immediately: those blocks must not be
reallocated until after the current delta has fully arrived on disk.
(When we have versioning, existing blocks are freed only when they
become unused by any version.) So get_segs will need a way of
reporting the list of freed blocks, which it does not have yet. But it
also does not need it right now, because we do not have versioning at
the moment, and a weaker form of atomic commit similar to Ext3's
data=ordered mode will serve our immediate needs.
Just looking at the code (again, writing about this helps) get_segs
could easily be refactored so that gaps are returned as holes for
write, just like read, and the existing update_tree code can be pulled
out into a separate function that calls get_segs, then does the
In spirit, get_segs resembles the traditional ->get_block interface
used by the block IO library, but has an interface better suited to the
problem rather than staying wedded to an ancient existing structure
(buffer_head) as ->get_block does. It works well as a way of gluing
the traditional ->get_block interface onto our model, as can be seen
from the implementation of tux3_get_block in filemap.c.
For mapping big regions with complex physical allocation patterns,
get_segs provides a much more efficient interface than the existing
block-at-a-time arrangement using buffer_heads, which was somewhat
painfully extended over time to handle multiple contiguous blocks in
The get_segs interface allows us to invert the traditional block library
approach of calling ->get_blocks multiple times to build up a
multi-block IO transfer. Instead, we retrieve the physical mapping for
a big logical region in one call to get_segs, with just one probe into
The current implementation of get_segs has a few deficiencies, still
being worked on:
- Existing extents that overlap the beginning or end of the logical
region are not handled correctly on write.
- The inner loop on extents always stops after the first extent,
which was to keep it compatible with Hirofumi's functional
implementation of tux3_get_block. Now the original loop can be
restored, and we just tell get_segs to return one segment at a
In fact, the segments I talk about here are really just extents.
When the interface settles down and proves itself out, the natural
thing to do is rename it as get_extents, which could potentially
evolve into a new, improved block library interface better suited
to modern filesystems.
Tux3 mailing list
Tux3 at tux3.org
More information about the Tux3