Design note: Allocation Groups
Daniel Phillips
d.phillips at partner.samsung.com
Sat Jan 25 22:51:03 PST 2014
Tux3 has borrowed yet another tried and true design concept from Ext4:
group descriptors, and as usual, we will do our best to improve this
proven technique. Some means of accelerating free space searches has
been planned for a long time, because our placeholder linear block
allocator has obvious scalability issues. Originally, I intended to fix
this issue with an approach called "free tags" that adds allocation
statistics into otherwise unused fields of bitmap table btree nodes.
While this would have enabled O(log N) free space searches, it would
also have introduced some issues, including coupling our asynchronous
frontend to backend metadata in a fairly nasty way. No doubt, we could
have made it work, but the end result would surely have been less than
satisfactory in terms of design simplicity.
Ted Ts'o talked me out of the free tags idea last year at the New
Orleans Linuxcon after I described the new Shardmap directory index
design, which includes it. To his credit, Ted immediately disliked the
proposal, and I countered that a simple linear allocation map would
probably work just as well. Later, I realized that a linear map would
actually work much better and be cleaner, and I also realized that I had
just reinvented Ext4 group descriptors.
A Tux3 group is some binary size number of blocks, independent of the
filesystem block size. Our default is the same as Ext4: 2**15 blocks per
group, being the number of blocks covered by a bitmap that fits in one
4K block. Due to our 15 bit group allocation count, that is the largest
group size we support. So a Tux3 block group covers at most 128 MB of
volume space, and has a minimum size yet to be determined, perhaps as
small as 128 bytes of bitmap, or 4 MB of volume space assuming 4K block
size. Because allocation group size is our granularity of allocation
accounting, we suspect that smaller group size will improve volume
layout, but this remains to be seen experimentally. We may also discover
that group size larger than 2**15 blocks is actually desirable and thus
be compelled to use larger allocation counters, but this seems unlikely.
Unlike Ext4, Tux3 allocation groups impose no layout restrictions. I
originally proposed some, for example, a bitmap block at some fixed
offset in every group or a superblock in every group, but Hirofumi
talked me out of it. As a result, a Tux3 volume continues to present a
single, clean allocation space where we can allocate anything anywhere.
Instead of fixed allocations, we have the concept of group affinity,
where we attempt to allocate blocks near some group to which they
nominally belong. A major motivation for group affinity is a future
feature: online fsck, performed incrementally, group by group. To make
this practical, we need to record some additional per group metadata
with overhead that depends on the frequency of cross group block
pointers. (Note: Larger group size reduces cross group block pointer
frequency, but smaller group size improves layout efficiency. Which is
more important? It might be best to resolve this tradeoff by introducing
a separate group check size.)
In contrast to Ext4's group descriptors, which are 32 or 64 bytes each,
a Tux3 group descriptor is just two bytes. This includes a 15 bit group
allocation count and one bit of state reserved for future use
(incremental fsck). Being 16 or 32 times smaller than Ext4 group
descriptors, we expect a measurable advantage in allocation search
performance. To get there, we omit a number of fields that Ext4 uses. We
do not need Ext4's pointers to block bitmap, inode bitmap or inode table
blocks because these are not per-group in Tux3 (we use an internal file
for the first and a btree for the other two). We do not record inode and
directory counts because we expect them to correlate roughly to block
allocation density. For the time being, we do not implement per group
checksums. Should it turn out that we actually need some of these
fields, the Tux3 way of doing things is to introduce an additional
table, which we can do without changing the existing one at all. By
dividing group descriptors across separate files we optimize for the
important case of long range search when large regions of high
allocation density exist, while performing about the same for other
operations like positioning inodes and directories.
Atomic commit
Adding new metadata to a filesystem can potentially be a big job due to
the atomic update requirement among other things. However, this added
only 64 new lines to the tux3 code base, because of our habit of
introducing new metadata in the form in system files, taking advantage
of infrastructure already available for handling files. The atomic
commit requirements for this new metadata are the same as for the
existing bitmap file, and we were able to apply the same techniques.
Like bitmaps, changed group descriptors do not need to be updated per
delta, but can be reconstructed in dirty cache from allocation records
during log replay, and only need to be written to media in periodic
unify cycles. Also like the bitmap file, the group map file is subject
to a subtle recursive update issue, which is solved in the same way.
Efficiency
Adding new metadata is also worrisome from the efficiency point of view.
Performance being one of the main features of Tux3, we would like to
keep it that way. It would be rather unpleasant to learn that our
excellent performance results were only ever due to omitting essential
functionality! So the goal for group map support was to introduce nearly
zero new overhead, either CPU or media transfer. It appears as though
that was achieved. Because of the large number of groups we cover with a
single group map block, new media overhead amounts to one block per
unify cycle, compared to something like a gigabyte of write traffic per
unify under heavy load. Group count accesses were optimized by pinning
the most recently accessed group map block in memory, which avoids a
block read per access. The same technique works for the bitmap file,
which we have not been doing so far, so in the end we will have a net
efficiency improvement.
Scaling
For small volumes, group accounting is pure overhead with no benefits.
However, with volume coverage of 128 MB per bitmap block, a volume does
not need to grow very large before the overhead of scanning bitmap
blocks become significant. In the common, linear allocation case we do
not see that because the next block we check for is usually free,
however as a volume ages we will scan already allocated regions more
often and at some point bitmap scanning will dominate allocation costs.
In fact, the complexity of linear scanning is O(N^2) over N allocation
operations, because the cost of each allocation increases by a linear
amount. The entire reason for having group counts is to cap this linear
growth at some acceptable value, so that we rarely scan more than one
group per allocation. Instead, we scan the group map to find an
allocation group with a satisfactory amount of free space.
The group map covers 2K bitmap blocks (256 GB of volume space) per 4K
group map block, so a volume must be very large indeed before group map
scanning overhead begins to ramp up on its own O(N^2) complexity curve.
For example, a 4 TB volume needs only 16 4K group map blocks. However, a
progress marches on, larger disks will arrive and users will create
larger Tux3 volumes. Fifteen years from now your home media server will
have a Petabyte disk drive in it, which if entirely allocated to a Tux3
volume, will need a 16MB group map file, and Tux3 will sink to its knees
if it needs to scan a large part of it per allocation. So the time will
come when we must add yet another level to the mapping hierarchy, which
we should plan for now, because as we all know, time tends to elapse
surprisingly quickly.
Adding two more mapping levels will support O(log N) free space
searching across our full 1 Exabyte design size. If each level maps a
full block of the lower level, we have 4M 4 byte counts at the second
level, mapped by 1K 8 byte counts at the third level. Additional map
levels increase the size of a unify modestly, and per allocation CPU
cost also increases slightly. At the current Ext4 of 16 TB we only have
64 blocks of first level counts, not enough to benefit from additional
mapping levels, so for now we will not invest implementation effort into
a design feature that adds overhead without measurable benefit. Instead,
we will take the opportunity to plan for forward compatibility.
Forward Compatibility
Consider a volume that is updated by a newer Tux3 version implementing
higher level mapping layers, then remounted by an older Tux3 version
that makes changes to level one counts, then once again remounted by the
newer Tux3 version. Such a senario is not at all unrealistic whenever
users have the option to boot different kernel versions. Forward
compatibility means that the volume does not become inconsistent
according to either the newer or older Tux3 version.
Typically, the older version needs to anticipate addition of specific
new data structures and invalidate them if found, while the newer
version must tolerate such invalidation. In this case, it is
straightforward: we reserve inode numbers for the additional levels but
do not create the inodes and on mount, truncate any that are found to
exist. Later Tux3 versions that know how to use this metadata can
automatically repopulate it.
While we are at it, we should do the same for all our reserved inodes,
because it costs roughly nothing and provides opportunities for forward
compatibility in areas we know nothing about today. In general, forward
compatibility simplifies adding new features without inconveniencing the
user. And forward compatibility encourages deferring big development
projects until actually needed, so we can develop with a simpler,
cleaner code base at early stages and focus our effort better. We will
therefore make a habit of identifying forward compatibility
opportunities as they arise. Directory indexing is another obvious one,
where for now we will just invalidate directory index metadata if
detected, so that future Tux3 versions fall back to linear search. This
strategy worked well back in HTree development days and taught me the
value of good forward compatibility design.
Regards,
Daniel
More information about the Tux3
mailing list