Tux3 Report: Meet Shardmap, the designated successor of HTree
d.phillips at partner.samsung.com
Tue Jun 18 19:31:45 PDT 2013
From time to time, one may fortunate enough to be blessed with a
discovery in computer science that succeeds at improving all four of
performance, scalability, reliability and simplicity. Of these normally
conflicting goals, simplicity is usually the most elusive. It is
therefore with considerable satisfaction that I present the results of
our recent development work in directory indexing technology, which
addresses some long-standing and vexing scalability problems exhibited
by HTree, my previous contribution to the art of directory indexing.
This new approach, Shardmap, will not only enhance Tux3 scalability, but
provides an upgrade path for Ext4 and Lustre as well. Shardmap is also
likely to be interesting for high performance database design. Best of
all, Shardmap is considerably simpler than the technology we expect it
The most interesting thing about Shardmap is that it remained
undiscovered for so long. I expect that you will agree that this is
particularly impressive, considering how obvious Shardmap is in
retrospect. I can only speculate that the reason for not seeing this
obvious solution is that we never asked the right question. The question
should have been: how do we fix this write multiplication issue? Instead
we spent ten years asking: what should be do about this cache
thrashing?. It turns out that an answer to the former is also an answer
to the latter.
Now let us proceed without further ado to a brief tour of Shardmap,
starting with the technology we expect it to replace.
The Problem with HTree
Occasionally we see LKML reports of performance issues in HTree at high
scale, usually from people running scalability benchmarks. Lustre users
have encountered these issues in real life. I always tended to shy away
from those discussions because, frankly, I did not see any satisfactory
answer, other than that HTree works perfectly well at the scale it was
designed for and at which it is normally used. Recently I did learn the
right answer: HTree is unfixable, and this is true of any media backed
B-Tree index. Let me reiterate: contrary to popular opinion, a media
backed B-Tree is an abysmally poor choice of information structure for
any randomly updated indexing load.
But how can this be, doesn't everybody use B-Trees in just this way?
Yes, and everybody is making a big mistake. Let me explain. The big
issue is write multiplication. Any index that groups entries together in
blocks will tend to have nearly every block dirty under a random update
load. How do we transfer all those dirty blocks to cache incrementally,
efficiently and atomically? We don't, it just cannot be done. In
practice, we end up writing out most index blocks multiple times due to
just a few small changes. For example, at the end of a mass update
create we may find that each block has been written hundreds of times.
Media transfer latency therefore dominates the operation.
This obvious issue somehow escaped our attention over the entire time
HTree has been in service. We have occasionally misattributed degraded
HTree performance to inode table thrashing. To be sure, thrashing at
high scale is a known problem with Tree, but it is not the biggest
problem. That would be write multiplication. To fix this, we need to
step back and adopt a completely different approach.
Dawning of the Light
I am kind of whacking myself on the forehead about this. For an entire
decade I thought that HTree could be fixed by incremental improvements
and consequently devoted considerable energy to that effort, the high
water mark of which was my PHTree post earlier this year:
The PHTree design is a respectable if uninspired piece of work that
fixes all the known issues with HTree except for write multiplication,
which I expected to be pretty easy. Far from it. The issue is
fundamental to the nature of B-Trees. Though not hitherto recognized in
the Linux file system community, academics recognized this issue some
time ago and have been busy hunting for a solution. During one of our
sushi meetings in the wilds of Mountain View, Kent Overstreet of BCache
fame pointed me at this work:
Such attempts generally fail to get anywhere close to the efficiency
levels we have become accustomed to with Ext4 and its ilk. But it got me
thinking along productive lines. (Thank you Kent!) One day the answer
just hit me like a slow rolling thunderbolt: instead of committing the
actual B-Tree to disk we should leave it dirty in cache and just log the
updates to it. This is obviously write-efficient and ACID friendly. It
is also a poor solution because it sacrifices recovery latency. In the
event of a crash we need to read the entire log to reconstruct the dirty
B-Tree, which could take several minutes. During this time, even though
the raw directory entries are immediately available, the index is
unavailable. At the scales we are considering, unindexed directory
access is roughly the same as no access at all.
Birth of Shardmap
Then a faster moving thunderbolt hit me: why not write out the log in
many small pieces, each covering a part of the hash range? That way, to
access a single uncached entry we only suffer the latency of loading one
small piece of the log. With this improvement, the index becomes
available immediately on restart.
Eureka! Shardmap was born in the form of a secondary hash index on a
B-Tree that would kick in under heavy load, and be merged back into the
primary B-Tree when the load eases up. Shardmap would be a "frontend"
index on the B-Tree, an concept similar to others we have already
employed in Tux3. This all seemed like a great idea and I immediately
began to implement a prototype to get an idea of performance.
A few days into my prototype I was hit by a third and final blinding
flash when I noticed that the primary B-Tree index is not needed at all:
the in-cache hash tables together with the hash shards logged to media
constitute a perfectly fine index all on their own. In fact, when I
investigated further, I found this arrangement to be superior to B-Tree
approaches in practically every way. (B-Trees still manage to eke out an
advantage in certain light out of cache loads, which I will not detail
here.) Shardmap was suddenly elevated from its humble status as
temporary secondary index to the glorious role of saving all of us from
HTree's scalability issues. Probably.
Shardmap is a simple and obvious idea, but is that not always the case
in retrospect? BSD folks came very close to discovering the same thing
back around the time I was inventing HTree:
At the time I recognized the hash table approach as promising, but
deeply flawed because of its need to load an entire directory just to
service a single access. Accordingly, HTree continued to be developed
and went on to rule the world.
I wish I had thought about that harder. All that remained to do to fix
the BSD Dirhash approach was log the hash table to media efficiently and
the result would have been Shardmap, many years ago. Well. Better late
The remainder of this post takes a deep dive into the proposition that
Shardmap is a suitable replacement for HTree, potentially suitable for
use in Tux3, Ext4, and Lustre.
A Shardmap index consists of a scalable number of index shards, starting
at one for the smallest indexed directories and increasing to 4096 for a
billion file directory. The maximum size of each shard also increases
over this range from 64K to four megabytes. Each shard contains index
entries for some subset of the hash key range. Each shard entry maps a
hash key to the logical block number of a directory entry block known to
contain a name that hashes to that key.
Each shard is represented as an unsorted fifo on disk and a small hash
table in memory. To search for a name, we use some high bits of the hash
to look up a cached shard hash in memory. If the shard is not hashed in
cache, then we load the corresponding shard fifo from media and convert
it to into a hash table. We walk this list of hash collisions, searching
each referenced directory block for a match on the name.
Clearly, hash collisions must be rare in order to avoid searching
multiple, potentially out of cache directory entry blocks. Therefore, we
compute and store our hash keys at a higher precision than our targeted
As a directory grows, we scale the shardmap in two ways: 1) Rehash a
cached shard to a larger number of hash buckets and 2) Reshard a stored
shard fifo to divide it into multiple, smaller shards. These operations
are staggered to avoid latency spikes. The reshard operation imposes a
modest degree of write multiplication on the Shardmap design,
asymptotically approaching a factor of two. This is far better than the
factor of several hundred we see with HTree.
The key ideas of Shardmap are: 1) the representation of directory data
is not the same on media as it is in cache. On media we have fifos, but
in cache we have hash tables. 2) Updating a fifo is cache efficient.
Only the tail block of the fifo needs to be present in cache. The cache
footprint of the media image of a shardmap is therefore just one disk
block per shard. 3) A small fifo on media is easily loaded and converted
to an efficient hash table shard on demand. Once in cache, index updates
are performed by updating the cached hash table and appending the same
entries to the final block of the shard fifo.
To record deletes durably, Shardmap appends negative fifo entries to
shard fifos. From time to time, a shard fifo containing delete entries
will be compacted and rewritten in its entirety to prevent unbounded
growth. This cleanup operation actually turns out to be the most complex
bit of code in Shardmap, and it is not very complex.
Like any other kind of Tux3 update, Shardmap updates are required to be
ACID. The Tux3 block forking mechanism makes this easy: each delta
transition effectively makes all dirty blocks of a directory read-only.
While transferring a previous delta to disk, directory entries may be
created in or removed from directory entry blocks on their way to disk,
so those blocks will be forked. The tail block of a shard fifo may also
be forked, and so may directory entry free maps. Under a pure create or
delete load, the additional cache load caused by page forking will
normally not be much more than the tail blocks of shard fifos. Other
loads will perform about as you would expect them to.
The cache footprint of an actively updated Shardmap index is necessarily
the entire index, true not only of Shardmap, but any randomly updated
index that groups entries together in blocks. If we contemplate running
a create test on a billion files, we must provide enough cache to
accommodate the entire index or we will thrash, it is as simple as that.
For a billion files we will need about eight gigabytes of cache. That is
actually not too bad, and reflects Shardmap's compact hash table design.
Less cache than that will force more disk accesses, and the test will
consequently run slower but correctly.
Shardmap's appetite for shard cache grows to extreme levels as
directories grow to billions of files. This is not unexpected, however
it does mean that we need to take some special care with our kernel
cache design. A shard hashtable in kernel will be an expanding array of
pages just as it is in userspace, however we will not have the virtual
memory hardware available to help us out here. On 32 bit hardware we
will be using highmem for the cache, with attendant inefficient
kmap/unmap operations and that will suck, but it will still work better
than HTree for gigantic directories. For smaller directories we can
adopt some other cache management strategy in order to avoid performance
Comparison to HTree
Like HTree, Shardmap uses tables of fixed size keys that are hashes of
names in order to rapidly locate some directory block containing
standard directory entries.
Unlike HTree, Shardmap has one index entry per directory entry. HTree
uses one index entry per directory entry block, and is unique in that
regard. This is one of the key design details that has made HTree so
hard to beat all this time. It also sounds like a big advantage over
Shardmap in terms of index size, but actually it is a tie because of
slack space normally found in B-Tree blocks.
Like HTree, a Shardmap index is mapped logically into directory file
data blocks (logical metadata in Tux3 parlance). This takes advantage of
the physical/logical layering of classic Unix filesystem design. In
concrete terms, it reuses the same cache operations for the index as are
already implemented for ordinary files and avoids complicating the Tux3
physical layer at which files and the inode table are defined. It also
provides a degree of modularity that is aesthetically pleasing in itself.
Unlike HTree, a Shardmap index is not interleaved with directory data.
We place the index data strictly above the directory entry blocks
because the index is relatively larger than an HTree index, roughly 20%
of the directory file. At first we intended to place the Shardmap index
at a very high logical address, but that plan as discarded when we
noticed that this adds several levels to the page cache radix tree,
which slows down all directory block accesses significantly (we measured
6 nanoseconds per radix tree level).
Our refined layout scheme places the Shardmap index a short distance
above the currently used directory blocks and relocates it higher as the
directory expands, so we incur about the same radix tree overhead as
would be required by a simple unindexed directory. This relocation
does not impose new overhead because we already must relocate the index
shards as a directory expands, to break them up and limit reload latency.
Unlike HTree, Shardmap needs to keep track of holes in directory entry
blocks created by deletions. HTree finesses this detail away by creating
each new entry at a particular place in the btree corresponding to its
hash; deleted entry records are simply forgotten in the hope that they
will eventually be compressed away during a block split operation.
Shardmap employs a free record map for this purpose, with one byte per
directory entry block that indicates the largest directory record
available in the directory block. To avoid churning this map, it is
updated lazily - the actual largest free record is never larger than the
size stored in the map, but may be smaller. If so, a failed search for
free space in the block will update the free map entry to the actual
largest size. Conversely, on delete the map is updated to be an
overestimate of the largest record size. The intention of these
heuristics is to reduce the amount of accounting data that needs to be
written out under sparse update loads.
At scales of billions of entries, even scanning a byte per directory
entry block is too expensive, so Shardmap goes on to map the free map
with an additional map layer where each byte gives the size of the
largest free directory record covered by the associated free map block.
Three levels of this structure maps 2**(12*3) = 2**36 directory blocks,
which should be sufficient for the foreseeable future.
Shardmap uses the elegant Siphash general purpose hash function
obligingly provided by Aumasson and Bernstein:
Siphash is a wonderful creation that is easily parametrized in terms of
mixing rounds to the degree required by the application. Shardmap is not
as demanding in this regard as HTree, which does not implement delete
coalescing and is therefore sensitive to slight hash distribution
anomalies. Shardmap is able to tolerate relatively fewer mixing rounds
in the interest of higher performance.
Siphash as implemented by Shardmap uses 2 rounds per 8 bytes versus
HTree's default Halfmd4 which uses 24 more complex rounds per 32 bytes,
a significant performance win for Shardmap:
Further, the Siphash dispersal pattern is specifically optimized for
hash table applications. Our recommendation is that existing HTree users
add Siphash as the default hash function in the interest of improved
With the Shardmap userspace prototype we were able to obtain some early
performance numbers, which I will describe in general terms with precise
details to follow. Roughly speaking, both create and delete run at
around 150 nanoseconds per operation, and do not appear to degrade
significantly as directory size increases into the hundreds of millions.
As expected, lookup operations are faster, roughly 100 nanoseconds each.
This is with a modestly specced workstation and a consumer grade SSD, so
spinning disk seek performance is not yet being measured.
In general, performance numbers obtained so far match HTree at modest
scale, definitively dominate at higher scales in the millions of files,
and continue on up into scales orders of magnitude higher than HTree can
Directory traversal is in file order for Shardmap, compared to hash
order for HTree, which entails a computationally intensive procedure of
sorting and caching intermediate search results required for correct
directory cookie semantics, a clear win for Shardmap. The effect of
avoiding HTree's inode table thrashing behaviour has not yet been
measured (this must wait for a kernel implementation) however it is safe
to assume that this will be a clear win for Shardmap as well.
A nearly complete userspace prototype including unit tests is now
available in the Tux3 repository:
We encourage interested observers to compile and run this standalone
code in order to verify our performance claims. It is also worth seeing
for yourself just how simple this code is.
The prototype currently lacks the following two important elements:
* The reshard operation. So far not tested, so we cannot speak to
the performance impact of reshard just yet, other than from
estimates that indicate it is small.
* Free record mapping. Likewise, this will add some additional,
modest update overhead that we have not yet measured, only
To be useful, Shardmap needs a kernel implementation, which we plan to
defer until after merging with mainline. Scalable directories are
certainly nice, but not essential for evaluating the fitness of Tux3 as
a filesystem in the context of personal or light server use. I will
comment further on the design of the kernel implementation in a later post.
Shardmap is a new directory index design created expressly for the Tux3
filesystem, but holds promise in application areas beyond that as an
upgrade to existing filesystems and probably also being applicable to
database design. Shardmap arguably constitutes a breakthrough in
performance, scalability and simplicity in the arcane field of directory
indexing, solving a number of well known and notoriously difficult
performance problems as it does.
We expect a kernel implementation of Shardmap to land some time in the
next few months. In the mean time, we are now satisfied with this key
aspect of the Tux3 design and will turn our attention back to the
handful of remaining issues that need to be addressed before offering
Tux3 for mainline kernel merge.
 This arguably indicates a flaw in the classic radix tree design,
possibly correctable to work better with sparse files.
 Maybe we should allow kernel modules to own and operate their own
virtual address tables, that is another story.
More information about the Tux3