Design note: Metadata readahead

Tue Feb 4 15:57:02 PST 2014

Here is a great cache optimization we can do that is really obvious and 
has been available to us for a long time.

Our volmap cache caches all physical metadata (metadata not mapped into 
inodes) and is really just a specialized page cache. Page cache for 
mapping inode data and volume map cache are supposed to be exclusive - 
any cached block in one will not be in the other. But nothing bad will 
happen if we load a page map block into the volume map, as long as it 
remains clean there. This gives us a natural mechanism for metadata 
readahead. We can load any block we want into the volume map, without 
knowing whether it is really metadata. If a metadata block read happens 
to hit that block then we probably save a synchronous stall on read, and 
maybe a seek too. Saving even one stall or seek is sufficient to justify 
a fairly large number of speculative reads into the volume cache.

Using the volume map this way essentially provides us with a huge 
physical disk cache, at least for metadata. We can extend the benefit to 
data as well, by checking the volume map on every page cache miss with 
the hope of hitting preloaded blocks and avoiding media reads. For now I 
will focus only on the benefits for metadata.

We can increase the probability of hitting future metadata on a 
speculative read by clustering metadata together in some known location. 
Such a location can either be globally known, such as near the base of 
an allocation group, or known by looking at higher pointers in the 
filesystem tree. Globally known locations is the easy place to start.

Another good place to start is with itree index nodes. There will be 
from zero to ten or so of these per allocation group, averaging perhaps 
three or four. If the allocation goal for these is consistently the same 
place, then some will be free blocks due to being released after delta 
commit. We can identify these by consulting the bitmap to avoid wasting 
bandwidth reading them, and also look into the volume map to avoid 
reading any blocks already present.

Here is some new terminology to help make things clear. In an N level 
btree, we normally number the levels 0 to N, where N is a leaf block. 
Alternatively, we can number the levels starting from the deepest level, 
using negative numbers, so that a btree node that points at leaf blocks 
is at level -1 and the root node is level -N.

My immediate proposal is to read level -1 itree nodes speculatively. The 
readahead code goes where we currently load btree nodes (probing a 
cursor using Tux3 terminology). A simple approach is, whenever we read a 
level -1 node, we do readahead for other nearby level -1 nodes. We know 
all the physical locations of these from the level -2 node, which is 
already in our btree cursor. "Nearby" might be defined as within a 
megabyte or so, somewhat less than a typical hard disk cylinder. The 
actual read can be optimized in various ways. In kernel, we want to 
submit an asynchronous bio, and ideally we would like to do ahead of 
time (as soon as the level -2 block lands in cache for example) instead 
of stalling on a block read as we are just about to do. For user space, 
we just bread the candidate blocks.

We should be able to measure a speedup even for a simple incarnation of 
this strategy. We can go further: if we know that the region we are 
reading from is fairly dense with metadata, we might as well preload all 
blocks in the region not free in the bitmap and not present in the 
volume cache. We can also take into account which allocation groups are 
"hot" and do readahead in several of them.

Our fsck is currently seeky and slow. This readahead strategy can 
accelerate it dramatically. This also changes the volume layout 
equation: it becomes more advantageous to clump btree node blocks 
together for readahead than to place them near children blocks. In some 
cases we may be able to do both, for example, we could interleave groups 
of itree leaf blocks with groups of directory blocks. Or we can place 
all itree leaf blocks together just above the itree node blocks at the 
base of the group, and end up with a group layout that looks a lot like 
Ext4 - even though we still have the option of storing anything anywhere 
(with better volume utilization, fewer limits, and simplicity of corner 
case handling) our optimal layout might resemble traditional Ext2/3/4.

Regards,

Daniel