Design note: Simplified implementation of free block tags

Daniel Phillips daniel.raymond.phillips at gmail.com
Mon Jun 24 22:59:23 PDT 2013


Free tags

Free tags in Tux3 will perforrm a similar function to Ext2's
"unallocated block" counts. In Ext2, an unallocated block count is a
16 bit field in the group descriptor for each block group. Tux3 does
not have group descriptors, but will use for a similar purpose a table
of "free tags", where the table is just a large linear array at some
offset in a special inode.

Accounting data provided by  free tags is required by several
different allocators in Tux3 in order to scale effectively to large
allocation spaces. Our allocation algorithms need to be able to locate
available free objects in volumes theoretically as large as one
exabyte, and in practice, up to several terabytes for personal
workstation usage. We need to track three kinds of free objects:

     Free Blocks
     Free Inode numbers
     Free directory records

Each of these kinds of objects has its own unique allocation
characteristics. For example with free directory records we are
primarily interested in the availability of records of a given size,
while for free blocks we are mainly interested in the total free block
count in some given region of a volume. We will attempt to view these
three kinds of allocation space as similar in some important respects,
and handle them with similar data structures and algorithms.

For now, we will just concentrate on prototyping sufficient
functionality for the first of these, free blocks, to enable us to
proceed with work on high level block allocation algorithms.

Initial free block tracking

Our initial implementation of free block tracking will use a simple
persistent table of 16 bit counts of the allocated blocks in a single
4K block bitmap block, starting at offset zero in the table.

Note: the size of a "bitmap block" is probably going to change for
non-4K volumes. For now it is one filesystem block. But this does not
scale nicely as block size increases. We will eventually change to a
fixed size bitmaps that do not change size as filesystem block size
changes, but have not yet done so. For now, each free tag map element
will always cover one block's worth of bitmap bits.

Note: we will be storing the allocated block count, not the free block
count, so that a block full of zeroes is interpreted as maximum
availability. The relationship is simply:

     free_blocks = 2**(12 + 3) - allocated blocks.

As with block bitmaps, we need to consider the possibility of
allocation recursion in this structure, that is, the case where
allocating a new block for the free block map itself causes a free
block tag to be updated. We will deal with this recursion in the same
way as the bitmap: we will log changes to the free block map instead
of directly writing out changed free block map blocks per delta.
Eventually, dirty free block map blocks will be written out as part of
a unify (rollup), just like bitmap blocks.

As an initial implementation, we will simply increment a free block
tag every time a bit changes from zero to one in the respective bitmap
block (a block is allocated), and decrement it every time a bit
changes from one to zero (a block is freed).

Future Optimization: lazy updates

It is possible that the simplistic update strategy above may slow down
writes measurably by generating additional CPU work updating tags,
increasing the size of per-delta logs, and increasing the number of
blocks needing to be flushed in a unify (rollup). If this turns out to
be the case, then it is thought that such overhead can be reduced by
lazily updating the tag maps. This proposition may be tested
experimentally, given the simple implementation described here to test
it against.

Future Optimization: one byte tags

It is possible that exact 16 bit block counts are higher precision
than is actually required by effective volume layout optimization
algorithms. An approximate representation that can be stored in a
single byte might do just as well, and cut down the size of the
mapping tables by a factor of two. This can be tested experimentally,
later.

Future Optimization: hierarchical maps

For very large volumes, we may find that scanning the free tag map
becomes inefficient. Then it may be worth mapping the tag map itself,
with a higher level tag map so that we can rapidly locate free space
in extremely large volumes. Whether such an optimization is actually
needed for realistic volume sizes is an open question that we can
test, given an initial, simple implementation.



More information about the Tux3 mailing list