Design note: Pseudocode for allocation heuristics
hirofumi at mail.parknet.co.jp
Sat Jul 13 03:57:10 PDT 2013
[Add inode allocation accelerate, as draft]
This is a long design note and not particularly well organized. In no
particular order, it touches on many of my thoughts on allocation
heuristics that have been developing since the Tux3 project began, or
indeed may originate with earlier work such as Tux2 or Ddsnap.
This note is way too long to read in detail in one pass. Please skip
forward to the pseudocode at the end, then return to the rationale below to
see what the code actually intends to accomplish. The pseudocode does not
implement all the ideas covered in the rationale, far from it, and perhaps
omits some essential detail. As a design document, this is just an initial
draft which hopefully will be maintained and improved. In the mean time,
work on a concrete implementation is about to start, so any major
objections need to surface now.
The time has arrived time for a specific proposal on allocation heuristics.
This is a challenging subject that tends to resist precise analysis because
of the complex behavior that may result from unpredictable allocation
loads. Today's proposal cuts away a lot of complexity by introducing the
design rule that allocation goal is mainly controlled by the choice of
inode number. In other words, we introduce an "artificial" tie between
inode number and allocation goal. It then becomes possible to present a pair
of simple algorithms, one for choosing inode numbers and one for satisfying
block allocation requests, that should work together to yield reasonable
layout and aging characteristics.
Our simple linear allocator
Currently, Tux3 uses a simple placeholder block allocation strategy: we
remember the last allocated block, and begin the next search for free space
just above that. We do the same thing for inode numbers. Further, we have
tweaked the order in which allocation calls are made so that metadata blocks
are allocated in a reasonable order for reading.
This works surprisingly well for a number of test loads we currently run,
however, there are also many examples where it performs poorly. When the
allocator reaches the end of a volume and wraps back to the beginning and
starts reusing blocks that were freed either by explicit deletes or by
copy-on-write, fragmentation will increase rapidly.
We use the same allocation strategy to choose inode numbers. With a 48 bit
inode number space available, the allocator effectively never wraps, but
also never reuses gaps in the inode table created by deletes. New files
created in an old directory will have inodes allocated far from the
original files, so that a mass operation on all files of the directory may
need to load more inode table blocks than would be necessary if inodes
inodes belonging to the same directory were packed together more
Our goal is improve our allocator to resist fragmentation as a Tux3 volume
ages under various loads. Designing a good and simple algorithm for this is
a challenging task. We can look to Ext3/4 for some guidance, which have very
good reputations for fragmentation resistance. However, the Tux3 problem is
dissimilar from the Ext3/4 problem because of the need to write updates
into freespace as opposed to overwriting existing data in place. Thus, Tux3
data always moves around as it is updated, which could cause increased
fragmentation without special treatment. On the other hand, this might also
present opportunities to improve layout as a volume ages.
Tux3 updates data in deltas, with the requirement that no blocks that will
be obsoleted and freed by a delta may be freed and possibly overwritten
until the entire delta transfer has completed. A delta can be as large as
all writable cache. This amount of data can potentially be "pinned" until
the delta is complete. The pinned data will often occupy the ideal
allocation targets that the new data would prefer to use. After all, we must
have carefully considered the question of where we should store that data in
the first place.
New data structures
* Free tag map: persistent array of allocation group usage counts to
accelerate global search for free blocks across a potentially large volume.
* Allocation tracker: in-memory record of most recent allocation position,
also inode of last allocation and logical position within last inode.
(An elaboration of current sb->lastblock. May eventually generalize to track
multiple parallel allocation behavior.)
Rationale (in random order)
* Object is to keep temporally related data and metadata together compactly
both initially and as a filesystem ages. This has the dual purpose of reducing
seeks on rotating media and reducing write multiplication due to false
sharing within flash media erase blocks.
* In general, optimizations that help reduce seeks will also either help
reduce write multiplication on flash media or do no harm. So for now we will
focus on optimizing for spinning media.
* Most inode or block allocations should simply search linearly for the
next available object if it is not too far away, because this is often the
best choice and it is efficient. To help this work well, we take care to
issue block allocation requests in an order that lays out the data and
metadata blocks in a good order for reading.
* Block allocation goal corresponds to inode number. This design rule may
be regarded as a temporary hack. It gives us a simple model for improving
behavior for random versus linear filesystem updates. Disadvantage is, may
be excessively rigid. Inode number selection is an initial guess based on
current and probable future allocation conditions, and may not correspond
well to actual future activity.
* We may improve on inum => goal later by providing an override mechanism
to re-specify the base goal for an inode. This will be a new inode attr, on
inode load the override goal will be cached in the inode so frontend does
not need to access the itable on each alloc to check for overrides.
Overrides would be mainly used on directories, implicitly overriding all
non-directory inodes within the directory because the directory goal is used
as the base for file extrapolation. Over time, the effect will be to migrate
the directory contents to the override goal location as updates are made.
Typically, contents would be clustered in just two or a small number of
regions at intermediate stages of such migration, an acceptable amount
of locality disturbance. Explicit defragmentation could be used to force a
* Nice property: when volume is initially nearly empty, we can make fairly
poor decisions about where future data should concentrate and not suffer
badly from them just by leaving a lot of slack space. When the volume starts
to fill up, we have more historical data to base layout decisions on, plus
we can change our mind via the goal override mechanism.
* In the interest of keeping the initial implementation work on smarter
heuristics small, we will initially only implement free tags for block
allocation, not inodes. Inode allocation use simple linear search based with
goals guided by heuristics described here, and other heuristics we will
discover from experience.
* We may eventually add bitmaps to map free elements in the itable. Whether
or not this would be a good thing is still an open question. Would the extra
cost of maintaining inode bitmaps in cache and possibly also durably on
media be more than balanced by improved scanning efficiency for free inodes?
Is the (modest) increase in metadata complexity justified? Inode bitmaps
would tidily solve one issue we have - frontend needs to reserve new inode
allocations, but only backend is allowed to update the itable. Currently
frontend keeps a list of newly allocated inodes which it consults prior to
any search within the itable, to avoid reallocating an already allocated
inode number. By front-ending the itable with an allocation bitmap, frontend
can mark the inode used in the bitmap, removing the requirement to search a
list on each allocation. We may find that the most efficient formulation is
to let the inode bitmap be a pure cache object where frontend loads inode
bitmap pages (or possibly finer granularity units) on demand by scanning the
inode table. On a linear load, in many cases only a single bitmap page would
be loaded. This would require scanning possibly quite a lot of the inode
table, but maybe that is good, it would be a kind of readahead in the inode
table. In contrast, a persistent implementation of inode free tags may be a
good solution. These free tags would track populations within regions of the
inode table, rather than being represented by durable bitmaps on media. Open
question, but leaning towards the volatile bitmap plus durable free tag map
* In comparison to the Ext* family we have a similar layout and aging
problem except that we have additional demands imposed by our nondestructive
data/metadata update design rule. It could be worse - a shared-tree snapshot
design like Btrfs can be expected to pin significantly more media blocks for
a longer time. We pin most freeable blocks only until delta completion, and
a much smaller number of metadata blocks until the next unify (rollup)
cycle. Nonetheless, the per-delta pinning requirement will often cause an ideal
goal choice to be unavailable. A simple rewrite of an entire file is a good
example - the entire file moves to a new position. However, the original
goal remains the same, so a following full rewrite has a good chance of
moving the file back to the original position. This behavior is expected to
be common, so we give it a name: flickering. If our heuristics are working
properly, we should see a lot of flickering. One danger is that flickering
could develop into data block interleaving between inodes. We may need to
implement specific heuristics to combat this.
* As a volume approaches full we should start to cap the delta size, to
reduce the demands imposed by flickering and to more efficiently consume the
remaining, probably fragmented blocks. We might also cap the maximum size of
a linear write for similar reasons. It remains to be determined if this is
* One way of thinking about our atomic commit strategy is that our journal
effectively lies all over the volume. The good news is, that is efficient
both for reading and writing. The bad news is, it requires leaving a lot of
slack space. As the volume fills up we need to leave less slack space, and
be mindful that this may cause harmful long range "flickering".
* Metadata blocks are the easiest to allocate because they can fit into
the smallest possible holes, but relatively more important to allocate
close to ideal positions because they may affect read seeking for
potentially large subtrees. So take care to leave some sufficient slack for
metadata blocks. Perhaps prefer to bounce some data allocation requests
away from regions that are nearly full, in order to leave some free blocks
for later metadata allocations.
* It is important to make a good initial selection of inode number because
for this initial algorithm, it sets the goal position of each inode or
* Bounce: check for some threshold amount of free space in the next block
group in a generating sequence that is randomized by inum, starting with
the original goal and progressively getting further away. After some limited
number of bounce stages, continue with a linear scan wrapping around all
block groups, to be sure of not overlooking any free space in the volume.
* Bouncing is our catch-all method of filling up available space while
trying to maintain reasonable allocation locality. If an allocation at a
particular goal region fails because of insufficient space or too many
fragments to fill an extent, then bounce to the next candidate goal and keep
trying. See bounce pattern.
* Bounce pattern is a generating function that produces a sequence of
allocation units (block group) to test for suitability as an allocation
goal. This pattern needs to vary depending on inode, to improve the
probability of densely filling available storage given unpredictable update
loads on nearby directories and regular files.
* No inode bounce for now because no itable free tags yet.
* One measure of our heuristic success will be average number of bounces
required to satisfy a goal search.
* Tie between inum and allocation goal provides an efficient test for
whether or not to use linear allocation: if in the same inode and logical
address of next request is contiguous with last, or if next request is at
the beginning of the immediately following inode, use linear allocation.
Otherwise, do a more expensive extrapolation.
See above rationale. Note how each allocator starts with simple test that
will hopefully choose the simple linear allocation path most often.
alloc inode (frontend)
if not a directory
if same parent dir
// continue linear alloc
if run > some limit
skip some inums // keep inums dense but realign sometimes
use last inum alloced + 1 as inum goal
extrapolate inum goal from dirent position and dir inum
else is a directory
// find a place the directory can grow
extrapolate dir inum goal using path depth and population pressure
alloc extent (backend)
if same or next inum
// continue linear alloc
if run > some limit
add some slack space // for rewrites
use last block alloced + 1 as block goal
search and if found
// compute new goal
extrapolate new block goal by logical position using inum as base
search and if found
* Inode allocation accelerate (cache of itree)
Because, to find free inum by traversing itree can be slow. And if
this traversal is slow, latency of frontend of inode creation will be
increased. So, this discuss a bit about acceleration of inode
Pros. and cons. of 3 ways cache strategy.
1) persistent inode bitmap (like ext4)
- makes faster frontend, but makes slower backend. Because,
backend has to maintain/flush the persistent inode bitmap.
2) in-core bitmap cache of itree (volatile inode bitmap)
- makes little faster frontend, no backend overhead, but memory
footprint is bigger.
- memory footprint can actually be same with (1). But penalty of
rebuilding the cache is greater than (1), because this has to
traverse itree again. So we may want to pin more memory than
(1), to archive same effect with (1). After all, this may be
bigger memory footprint than (1).
3) no cache (btree cursor cache only)
- if we add cursor cache of btree, inode allocation might be
enough fast. If this is true, we will not need to bother by
OGAWA Hirofumi <hirofumi at mail.parknet.co.jp>
More information about the Tux3