Design note: Allocation Groups

Sat Jan 25 22:51:03 PST 2014

Tux3 has borrowed yet another tried and true design concept from Ext4: 
group descriptors, and as usual, we will do our best to improve this 
proven technique. Some means of accelerating free space searches has 
been planned for a long time, because our placeholder linear block 
allocator has obvious scalability issues. Originally, I intended to fix 
this issue with an approach called "free tags" that adds allocation 
statistics into otherwise unused fields of bitmap table btree nodes. 
While this would have enabled O(log N) free space searches, it would 
also have introduced some issues, including coupling our asynchronous 
frontend to backend metadata in a fairly nasty way. No doubt, we could 
have made it work, but the end result would surely have been less than 
satisfactory in terms of design simplicity.

Ted Ts'o talked me out of the free tags idea last year at the New 
Orleans Linuxcon after I described the new Shardmap directory index 
design, which includes it. To his credit, Ted immediately disliked the 
proposal, and I countered that a simple linear allocation map would 
probably work just as well. Later, I realized that a linear map would 
actually work much better and be cleaner, and I also realized that I had 
just reinvented Ext4 group descriptors.

A Tux3 group is some binary size number of blocks, independent of the 
filesystem block size. Our default is the same as Ext4: 2**15 blocks per 
group, being the number of blocks covered by a bitmap that fits in one 
4K block. Due to our 15 bit group allocation count, that is the largest 
group size we support. So a Tux3 block group covers at most 128 MB of 
volume space, and has a minimum size yet to be determined, perhaps as 
small as 128 bytes of bitmap, or 4 MB of volume space assuming 4K block 
size. Because allocation group size is our granularity of allocation 
accounting, we suspect that smaller group size will improve volume 
layout, but this remains to be seen experimentally. We may also discover 
that group size larger than 2**15 blocks is actually desirable and thus 
be compelled to use larger allocation counters, but this seems unlikely.

Unlike Ext4, Tux3 allocation groups impose no layout restrictions. I 
originally proposed some, for example, a bitmap block at some fixed 
offset in every group or a superblock in every group, but Hirofumi 
talked me out of it. As a result, a Tux3 volume continues to present a 
single, clean allocation space where we can allocate anything anywhere. 
Instead of fixed allocations, we have the concept of group affinity, 
where we attempt to allocate blocks near some group to which they 
nominally belong. A major motivation for group affinity is a future 
feature: online fsck, performed  incrementally, group by group. To make 
this practical, we need to record some additional per group metadata 
with overhead that depends on the frequency of cross group block 
pointers. (Note: Larger group size reduces cross group block pointer 
frequency, but smaller group size improves layout efficiency. Which is 
more important? It might be best to resolve this tradeoff by introducing 
a separate group check size.)

In contrast to Ext4's group descriptors, which are 32 or 64 bytes each, 
a Tux3 group descriptor is just two bytes. This includes a 15 bit group 
allocation count and one bit of state reserved for future use 
(incremental fsck). Being 16 or 32 times smaller than Ext4 group 
descriptors, we expect a measurable advantage in allocation search 
performance. To get there, we omit a number of fields that Ext4 uses. We 
do not need Ext4's pointers to block bitmap, inode bitmap or inode table 
blocks because these are not per-group in Tux3 (we use an internal file 
for the first and a btree for the other two). We do not record inode and 
directory counts because we expect them to correlate roughly to block 
allocation density. For the time being, we do not implement per group 
checksums. Should it turn out that we actually need some of these 
fields, the Tux3 way of doing things is to introduce an additional 
table, which we can do without changing the existing one at all. By 
dividing group descriptors across separate files we optimize for the 
important case of long range search when large regions of high 
allocation density exist, while performing about the same for other 
operations like positioning inodes and directories.

Atomic commit

Adding new metadata to a filesystem can potentially be a big job due to 
the atomic update requirement among other things. However, this added 
only 64 new lines to the tux3 code base, because of our habit of 
introducing new metadata in the form in system files, taking advantage 
of infrastructure already available for handling files. The atomic 
commit requirements for this new metadata are the same as for the 
existing bitmap file, and we were able to apply the same techniques. 
Like bitmaps, changed group descriptors do not need to be updated per 
delta, but can be reconstructed in dirty cache from allocation records 
during log replay, and only need to be written to media in periodic 
unify cycles. Also like the bitmap file, the group map file is subject 
to a subtle recursive update issue, which is solved in the same way.

Efficiency

Adding new metadata is also worrisome from the efficiency point of view. 
Performance being one of the main features of Tux3, we would like to 
keep it that way. It would be rather unpleasant to learn that our 
excellent performance results were only ever due to omitting essential 
functionality! So the goal for group map support was to introduce nearly 
zero new overhead, either CPU or media transfer. It appears as though 
that was achieved. Because of the large number of groups we cover with a 
single group map block, new media overhead amounts to one block per 
unify cycle, compared to something like a gigabyte of write traffic per 
unify under heavy load. Group count accesses were optimized by pinning 
the most recently accessed group map block in memory, which avoids a 
block read per access. The same technique works for the bitmap file, 
which we have not been doing so far, so in the end we will have a net 
efficiency improvement.

Scaling

For small volumes, group accounting is pure overhead with no benefits. 
However, with volume coverage of 128 MB per bitmap block, a volume does 
not need to grow very large before the overhead of scanning bitmap 
blocks become significant. In the common, linear allocation case we do 
not see that because the next block we check for is usually free, 
however as a volume ages we will scan already allocated regions more 
often and at some point bitmap scanning will dominate allocation costs. 
In fact, the complexity of linear scanning is O(N^2) over N allocation 
operations, because the cost of each allocation increases by a linear 
amount. The entire reason for having group counts is to cap this linear 
growth at some acceptable value, so that we rarely scan more than one 
group per allocation. Instead, we scan the group map to find an 
allocation group with a satisfactory amount of free space.

The group map covers 2K bitmap blocks (256 GB of volume space) per 4K 
group map block, so a volume must be very large indeed before group map 
scanning overhead begins to ramp up on its own O(N^2) complexity curve. 
For example, a 4 TB volume needs only 16 4K group map blocks. However, a 
progress marches on, larger disks will arrive and users will create 
larger Tux3 volumes. Fifteen years from now your home media server will 
have a Petabyte disk drive in it, which if entirely allocated to a Tux3 
volume, will need a 16MB group map file, and Tux3 will sink to its knees 
if it needs to scan a large part of it per allocation. So the time will 
come when we must add yet another level to the mapping hierarchy, which 
we should plan for now, because as we all know, time tends to elapse 
surprisingly quickly.

Adding two more mapping levels will support O(log N) free space 
searching across our full 1 Exabyte design size. If each level maps a 
full block of the lower level, we have 4M 4 byte counts at the second 
level, mapped by 1K 8 byte counts at the third level. Additional map 
levels increase the size of a unify modestly, and per allocation CPU 
cost also increases slightly. At the current Ext4 of 16 TB we only have 
64 blocks of first level counts, not enough to benefit from additional 
mapping levels, so for now we will not invest implementation effort into 
a design feature that adds overhead without measurable benefit. Instead, 
we will take the opportunity to plan for forward compatibility.

Forward Compatibility

Consider a volume that is updated by a newer Tux3 version implementing 
higher level mapping layers, then remounted by an older Tux3 version 
that makes changes to level one counts, then once again remounted by the 
newer Tux3 version. Such a senario is not at all unrealistic whenever 
users have the option to boot different kernel versions. Forward 
compatibility means that the volume does not become inconsistent 
according to either the newer or older Tux3 version.

Typically, the older version needs to anticipate addition of specific 
new data structures and invalidate them if found, while the newer 
version must tolerate such invalidation. In this case, it is 
straightforward: we reserve inode numbers for the additional levels but 
do not create the inodes and on mount, truncate any that are found to 
exist. Later Tux3 versions that know how to use this metadata can 
automatically repopulate it.

While we are at it, we should do the same for all our reserved inodes, 
because it costs roughly nothing and provides opportunities for forward 
compatibility in areas we know nothing about today. In general, forward 
compatibility simplifies adding new features without inconveniencing the 
user. And forward compatibility encourages deferring big development 
projects until actually needed, so we can develop with a simpler, 
cleaner code base at early stages and focus our effort better. We will 
therefore make a habit of identifying forward compatibility 
opportunities as they arise. Directory indexing is another obvious one, 
where for now we will just invalidate directory index metadata if 
detected, so that future Tux3 versions fall back to linear search. This 
strategy worked well back in HTree development days and taught me the 
value of good forward compatibility design.

Regards,

Daniel