[Tux3] All about the free tree

Daniel Phillips phillips at phunq.net
Wed Aug 13 14:43:06 PDT 2008


The Tux3 free block map is a dynamically allocated tree, and this
introduces a recursion: what happens when we try to allocate a block
but the leaf of the tree that maps the block does not exist yet, has to
be allocated, and we want to allocate it on that same free map leaf
that does not yet exist?  Answer: infinite recursion ending with stack
overflow, unless the recursion is limited in some way.  I considered
two obvious approaches:

 * Do not record allocations of free tree leaves or nodes in the free
   tree itself, but log them using a logical log entry as already
   planned for the atomic commit mechanism.

 * Use bitmaps for now, not extents.  Preallocate the whole free map
   without using an allocator, deciding where to place the blocks by
   an ad hoc method.  Once fully preallocated, enter the btree block
   addresses into the free map "by hand" as ddsnap does.

Then I thought: since I am going to allocate with bitmaps at first
anyway, why not put the bitmaps in a file instead of a special purpose
btree.  Then the bitmap blocks can be accessed through the file cache
(aka page cache in kernel, or logical block cache as I have sometimes
called it).

The really cool thing about this strategy is, it unexpectedly solves
the above recursion.  Changes to the bitmap are made in the cache,
which is a simple logically indexed array.  This naturally defers the
actual allocations until the cache is flushed to disk.

Woohoo, this is nice.  Now what about when we go to extent based free
space mapping?  Ok, that is easy.  There will be a separate free map
btree with extents at the leaves.  That btree could be mapped into a
file just like an htree directory index, but that would not avoid the
recursion.  This is where the logical logging method described above
can be used.

(It is not yet clear to me whether it is best to map the free extent
btree into a file or have it separate, at the same level in the Tux3
structure as the inode table.)

An extent based free map is efficient only for large allocations,
which are possible only when free space is not fragmented.  Sometimes
free space will be so fragmented in a given region that it would be
more efficient to use a bitmap for that region.  So I plan to implement
both methods and combine them in Tux3.  There will be running
statistics to keep track of fragmentation per 128 MB region (the size
that can be mapped by a single 4K bitmap block).  If fragmentation
rises above some threshold then several blocks of the free extent map
will be replaced by a specially marked pointer that is the logical
block number of a bitmap block in the bitmap file.  Likewise, if it
falls below some threshold across several regions then some bitmap
blocks will be replaced by a single extent leaf.

What do we mean by "large allocations"?  First consider a maximally
fragmented situation where every second block in an 8 block region is
free.  That can be represented in one byte with a bitmap, but requires
four 12 byte extents (8 byte pointer + count, 4 byte logical index
overhead per the dleaf design) for a total of 48 bytes, plus some per
leaf overhead.  A factor of 50 difference in storage efficiency, and
that is huge.

Now where is the crossover point where extents become more efficient
that a bitmap?  If every 128th block is free in a 512 block region,
that also requires 48 bytes to represent with extents, but 64 bytes
with bitmaps.  The crossover point is somewhere between average run
size of 64 and 128 free blocks in a given region.  The same logic
applies to runs of used blocks.  By keeping statistics of run size per
128 MB region we know that a region can be represented more efficiently
by extents if the run size is well above 128 and by a bitmap if it is
well below 64.  In between it does not matter much, which provides a
lazy band to keep the representation from changing too frequently.

Now it is time to implement the bitmap version of Tux3 free map.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list