Design note: Metadata readahead
d.phillips at partner.samsung.com
Tue Feb 4 15:57:02 PST 2014
Here is a great cache optimization we can do that is really obvious and
has been available to us for a long time.
Our volmap cache caches all physical metadata (metadata not mapped into
inodes) and is really just a specialized page cache. Page cache for
mapping inode data and volume map cache are supposed to be exclusive -
any cached block in one will not be in the other. But nothing bad will
happen if we load a page map block into the volume map, as long as it
remains clean there. This gives us a natural mechanism for metadata
readahead. We can load any block we want into the volume map, without
knowing whether it is really metadata. If a metadata block read happens
to hit that block then we probably save a synchronous stall on read, and
maybe a seek too. Saving even one stall or seek is sufficient to justify
a fairly large number of speculative reads into the volume cache.
Using the volume map this way essentially provides us with a huge
physical disk cache, at least for metadata. We can extend the benefit to
data as well, by checking the volume map on every page cache miss with
the hope of hitting preloaded blocks and avoiding media reads. For now I
will focus only on the benefits for metadata.
We can increase the probability of hitting future metadata on a
speculative read by clustering metadata together in some known location.
Such a location can either be globally known, such as near the base of
an allocation group, or known by looking at higher pointers in the
filesystem tree. Globally known locations is the easy place to start.
Another good place to start is with itree index nodes. There will be
from zero to ten or so of these per allocation group, averaging perhaps
three or four. If the allocation goal for these is consistently the same
place, then some will be free blocks due to being released after delta
commit. We can identify these by consulting the bitmap to avoid wasting
bandwidth reading them, and also look into the volume map to avoid
reading any blocks already present.
Here is some new terminology to help make things clear. In an N level
btree, we normally number the levels 0 to N, where N is a leaf block.
Alternatively, we can number the levels starting from the deepest level,
using negative numbers, so that a btree node that points at leaf blocks
is at level -1 and the root node is level -N.
My immediate proposal is to read level -1 itree nodes speculatively. The
readahead code goes where we currently load btree nodes (probing a
cursor using Tux3 terminology). A simple approach is, whenever we read a
level -1 node, we do readahead for other nearby level -1 nodes. We know
all the physical locations of these from the level -2 node, which is
already in our btree cursor. "Nearby" might be defined as within a
megabyte or so, somewhat less than a typical hard disk cylinder. The
actual read can be optimized in various ways. In kernel, we want to
submit an asynchronous bio, and ideally we would like to do ahead of
time (as soon as the level -2 block lands in cache for example) instead
of stalling on a block read as we are just about to do. For user space,
we just bread the candidate blocks.
We should be able to measure a speedup even for a simple incarnation of
this strategy. We can go further: if we know that the region we are
reading from is fairly dense with metadata, we might as well preload all
blocks in the region not free in the bitmap and not present in the
volume cache. We can also take into account which allocation groups are
"hot" and do readahead in several of them.
Our fsck is currently seeky and slow. This readahead strategy can
accelerate it dramatically. This also changes the volume layout
equation: it becomes more advantageous to clump btree node blocks
together for readahead than to place them near children blocks. In some
cases we may be able to do both, for example, we could interleave groups
of itree leaf blocks with groups of directory blocks. Or we can place
all itree leaf blocks together just above the itree node blocks at the
base of the group, and end up with a group layout that looks a lot like
Ext4 - even though we still have the option of storing anything anywhere
(with better volume utilization, fewer limits, and simplicity of corner
case handling) our optimal layout might resemble traditional Ext2/3/4.
More information about the Tux3