Tux3 Report: Meet Shardmap, the designated successor of HTree

Tue Jun 18 19:31:45 PDT 2013

Greetings all,

 From time to time, one may fortunate enough to be blessed with a 
discovery in computer science that succeeds at improving all four of 
performance, scalability, reliability and simplicity. Of these normally 
conflicting goals, simplicity is usually the most elusive. It is 
therefore with considerable satisfaction that I present the results of 
our recent development work in directory indexing technology, which 
addresses some long-standing and vexing scalability problems exhibited 
by HTree, my previous contribution to the art of directory indexing. 
This new approach, Shardmap, will not only enhance Tux3 scalability, but 
provides an upgrade path for Ext4 and Lustre as well. Shardmap is also 
likely to be interesting for high performance database design. Best of 
all, Shardmap is considerably simpler than the technology we expect it 
to replace.

The most interesting thing about Shardmap is that it remained 
undiscovered for so long. I expect that you will agree that this is 
particularly impressive, considering how obvious Shardmap is in 
retrospect. I can only speculate that the reason for not seeing this 
obvious solution is that we never asked the right question. The question 
should have been: how do we fix this write multiplication issue? Instead 
we spent ten years asking: what should be do about this cache 
thrashing?. It turns out that an answer to the former is also an answer 
to the latter.

Now let us proceed without further ado to a brief tour of Shardmap, 
starting with the technology we expect it to replace.

The Problem with HTree

Occasionally we see LKML reports of performance issues in HTree at high 
scale, usually from people running scalability benchmarks. Lustre users 
have encountered these issues in real life. I always tended to shy away 
from those discussions because, frankly, I did not see any satisfactory 
answer, other than that HTree works perfectly well at the scale it was 
designed for and at which it is normally used. Recently I did learn the 
right answer: HTree is unfixable, and this is true of any media backed 
B-Tree index. Let me reiterate: contrary to popular opinion, a media 
backed B-Tree is an abysmally poor choice of information structure for 
any randomly updated indexing load.

But how can this be, doesn't everybody use B-Trees in just this way? 
Yes, and everybody is making a big mistake. Let me explain. The big 
issue is write multiplication. Any index that groups entries together in 
blocks will tend to have nearly every block dirty under a random update 
load. How do we transfer all those dirty blocks to cache incrementally, 
efficiently and atomically? We don't, it just cannot be done. In 
practice, we end up writing out most index blocks multiple times due to 
just a few small changes. For example, at the end of a mass update 
create we may find that each block has been written hundreds of times. 
Media transfer latency therefore dominates the operation.

This obvious issue somehow escaped our attention over the entire time 
HTree has been in service. We have occasionally misattributed degraded 
HTree performance to inode table thrashing. To be sure, thrashing at 
high scale is a known problem with Tree, but it is not the biggest 
problem. That would be write multiplication. To fix this, we need to 
step back and adopt a completely different approach.

Dawning of the Light

I am kind of whacking myself on the forehead about this. For an entire 
decade I thought that HTree could be fixed by incremental improvements 
and consequently devoted considerable energy to that effort, the high 
water mark of which was my PHTree post earlier this year:

    http://phunq.net/pipermail/tux3/2013-January/000026.html

The PHTree design is a respectable if uninspired piece of work that 
fixes all the known issues with HTree except for write multiplication, 
which I expected to be pretty easy. Far from it. The issue is 
fundamental to the nature of B-Trees. Though not hitherto recognized in 
the Linux file system community, academics recognized this issue some 
time ago and have been busy hunting for a solution. During one of our 
sushi meetings in the wilds of Mountain View, Kent Overstreet of BCache 
fame pointed me at this work:

    http://www.tokutek.com/2012/12/fractal-tree-indexing-overview/

Such attempts generally fail to get anywhere close to the efficiency 
levels we have become accustomed to with Ext4 and its ilk. But it got me 
thinking along productive lines. (Thank you Kent!) One day the answer 
just hit me like a slow rolling thunderbolt: instead of committing the 
actual B-Tree to disk we should leave it dirty in cache and just log the 
updates to it. This is obviously write-efficient and ACID friendly. It 
is also a poor solution because it sacrifices recovery latency. In the 
event of a crash we need to read the entire log to reconstruct the dirty 
B-Tree, which could take several minutes. During this time, even though 
the raw directory entries are immediately available, the index is 
unavailable. At the scales we are considering, unindexed directory 
access is roughly the same as no access at all.

Birth of Shardmap

Then a faster moving thunderbolt hit me: why not write out the log in 
many small pieces, each covering a part of the hash range? That way, to 
access a single uncached entry we only suffer the latency of loading one 
small piece of the log. With this improvement, the index becomes 
available immediately on restart.

Eureka! Shardmap was born in the form of a secondary hash index on a 
B-Tree that would kick in under heavy load, and be merged back into the 
primary B-Tree when the load eases up. Shardmap would be a "frontend" 
index on the B-Tree, an concept similar to others we have already 
employed in Tux3. This all seemed like a great idea and I immediately 
began to implement a prototype to get an idea of performance.

A few days into my prototype I was hit by a third and final blinding 
flash when I noticed that the primary B-Tree index is not needed at all: 
the in-cache hash tables together with the hash shards logged to media 
constitute a perfectly fine index all on their own. In fact, when I 
investigated further, I found this arrangement to be superior to B-Tree 
approaches in practically every way. (B-Trees still manage to eke out an 
advantage in certain light out of cache loads, which I will not detail 
here.) Shardmap was suddenly elevated from its humble status as 
temporary secondary index to the glorious role of saving all of us from 
HTree's scalability issues. Probably.

Shardmap is a simple and obvious idea, but is that not always the case 
in retrospect? BSD folks came very close to discovering the same thing 
back around the time I was inventing HTree:

    http://en.wikipedia.org/wiki/Dirhash

At the time I recognized the hash table approach as promising, but 
deeply flawed because of its need to load an entire directory just to 
service a single access. Accordingly, HTree continued to be developed 
and went on to rule the world.

I wish I had thought about that harder. All that remained to do to fix 
the BSD Dirhash approach was log the hash table to media efficiently and 
the result would have been Shardmap, many years ago. Well. Better late 
than never.

The remainder of this post takes a deep dive into the proposition that 
Shardmap is a suitable replacement for HTree, potentially suitable for 
use in Tux3, Ext4, and Lustre.

Design

A Shardmap index consists of a scalable number of index shards, starting 
at one for the smallest indexed directories and increasing to 4096 for a 
billion file directory. The maximum size of each shard also increases 
over this range from 64K to four megabytes. Each shard contains index 
entries for some subset of the hash key range. Each shard entry maps a 
hash key to the logical block number of a directory entry block known to 
contain a name that hashes to that key.

Each shard is represented as an unsorted fifo on disk and a small hash 
table in memory. To search for a name, we use some high bits of the hash 
to look up a cached shard hash in memory. If the shard is not hashed in 
cache, then we load the corresponding shard fifo from media and convert 
it to into a hash table. We walk this list of hash collisions, searching 
each referenced directory block for a match on the name.

Clearly, hash collisions must be rare in order to avoid searching 
multiple, potentially out of cache directory entry blocks. Therefore, we 
compute and store our hash keys at a higher precision than our targeted 
scalability range.

As a directory grows, we scale the shardmap in two ways: 1) Rehash a 
cached shard to a larger number of hash buckets and 2) Reshard a stored 
shard fifo to divide it into multiple, smaller shards. These operations 
are staggered to avoid latency spikes. The reshard operation imposes a 
modest degree of write multiplication on the Shardmap design, 
asymptotically approaching a factor of two. This is far better than the 
factor of several hundred we see with HTree.

The key ideas of Shardmap are: 1) the representation of directory data 
is not the same on media as it is in cache. On media we have fifos, but 
in cache we have hash tables. 2) Updating a fifo is cache efficient. 
Only the tail block of the fifo needs to be present in cache. The cache 
footprint of the media image of a shardmap is therefore just one disk 
block per shard. 3) A small fifo on media is easily loaded and converted 
to an efficient hash table shard on demand. Once in cache, index updates 
are performed by updating the cached hash table and appending the same 
entries to the final block of the shard fifo.

To record deletes durably, Shardmap appends negative fifo entries to 
shard fifos. From time to time, a shard fifo containing delete entries 
will be compacted and rewritten in its entirety to prevent unbounded 
growth. This cleanup operation actually turns out to be the most complex 
bit of code in Shardmap, and it is not very complex.

Like any other kind of Tux3 update, Shardmap updates are required to be 
ACID. The Tux3 block forking mechanism makes this easy: each delta 
transition effectively makes all dirty blocks of a directory read-only. 
While transferring a previous delta to disk, directory entries may be 
created in or removed from directory entry blocks on their way to disk, 
so those blocks will be forked. The tail block of a shard fifo may also 
be forked, and so may directory entry free maps. Under a pure create or 
delete load, the additional cache load caused by page forking will 
normally not be much more than the tail blocks of shard fifos. Other 
loads will perform about as you would expect them to.

The cache footprint of an actively updated Shardmap index is necessarily 
the entire index, true not only of Shardmap, but any randomly updated 
index that groups entries together in blocks. If we contemplate running 
a create test on a billion files, we must provide enough cache to 
accommodate the entire index or we will thrash, it is as simple as that. 
For a billion files we will need about eight gigabytes of cache. That is 
actually not too bad, and reflects Shardmap's compact hash table design. 
Less cache than that will force more disk accesses, and the test will 
consequently run slower but correctly.

Shardmap's appetite for shard cache grows to extreme levels as 
directories grow to billions of files. This is not unexpected, however 
it does mean that we need to take some special care with our kernel 
cache design. A shard hashtable in kernel will be an expanding array of 
pages just as it is in userspace, however we will not have the virtual 
memory hardware available to help us out here[2]. On 32 bit hardware we 
will be using highmem for the cache, with attendant inefficient 
kmap/unmap operations and that will suck, but it will still work better 
than HTree for gigantic directories. For smaller directories we can 
adopt some other cache management strategy in order to avoid performance 
regressions.

Comparison to HTree

Like HTree, Shardmap uses tables of fixed size keys that are hashes of 
names in order to rapidly locate some directory block containing 
standard directory entries.

Unlike HTree, Shardmap has one index entry per directory entry. HTree 
uses one index entry per directory entry block, and is unique in that 
regard. This is one of the key design details that has made HTree so 
hard to beat all this time. It also sounds like a big advantage over 
Shardmap in terms of index size, but actually it is a tie because of 
slack space normally found in B-Tree blocks.

Like HTree, a Shardmap index is mapped logically into directory file 
data blocks (logical metadata in Tux3 parlance). This takes advantage of 
the physical/logical layering of classic Unix filesystem design. In 
concrete terms, it reuses the same cache operations for the index as are 
already implemented for ordinary files and avoids complicating the Tux3 
physical layer at which files and the inode table are defined. It also 
provides a degree of modularity that is aesthetically pleasing in itself.

Unlike HTree, a Shardmap index is not interleaved with directory data. 
We place the index data strictly above the directory entry blocks 
because the index is relatively larger than an HTree index, roughly 20% 
of the directory file. At first we intended to place the Shardmap index 
at a very high logical address, but that plan as discarded when we 
noticed that this adds several levels to the page cache radix tree, 
which slows down all directory block accesses significantly (we measured 
6 nanoseconds per radix tree level).

Our refined layout scheme places the Shardmap index a short distance 
above the currently used directory blocks and relocates it higher as the 
directory expands, so we incur about the same radix tree overhead as 
would be required by a simple unindexed directory[1]. This relocation 
does not impose new overhead because we already must relocate the index 
shards as a directory expands, to break them up and limit reload latency.

Unlike HTree, Shardmap needs to keep track of holes in directory entry 
blocks created by deletions. HTree finesses this detail away by creating 
each new entry at a particular place in the btree corresponding to its 
hash; deleted entry records are simply forgotten in the hope that they 
will eventually be compressed away during a block split operation.

Shardmap employs a free record map for this purpose, with one byte per
directory entry block that indicates the largest directory record 
available in the directory block. To avoid churning this map, it is 
updated lazily - the actual largest free record is never larger than the 
size stored in the map, but may be smaller. If so, a failed search for 
free space in the block will update the free map entry to the actual 
largest size. Conversely, on delete the map is updated to be an 
overestimate of the largest record size. The intention of these
heuristics is to reduce the amount of accounting data that needs to be 
written out under sparse update loads.

At scales of billions of entries, even scanning a byte per directory 
entry block is too expensive, so Shardmap goes on to map the free map 
with an additional map layer where each byte gives the size of the 
largest free directory record covered by the associated free map block. 
Three levels of this structure maps 2**(12*3) = 2**36 directory blocks, 
which should be sufficient for the foreseeable future.

Shardmap uses the elegant Siphash general purpose hash function 
obligingly provided by Aumasson and Bernstein:

    https://131002.net/siphash/

Siphash is a wonderful creation that is easily parametrized in terms of 
mixing rounds to the degree required by the application. Shardmap is not 
as demanding in this regard as HTree, which does not implement delete 
coalescing and is therefore sensitive to slight hash distribution 
anomalies. Shardmap is able to tolerate relatively fewer mixing rounds 
in the interest of higher performance.

Siphash as implemented by Shardmap uses 2 rounds per 8 bytes versus 
HTree's default Halfmd4 which uses 24 more complex rounds per 32 bytes, 
a significant performance win for Shardmap:

    http://lxr.free-electrons.com/source/lib/halfmd4.c#L25

https://github.com/OGAWAHirofumi/tux3/blob/master/user/devel/shard.c#L60

Further, the Siphash dispersal pattern is specifically optimized for 
hash table applications. Our recommendation is that existing HTree users 
add Siphash as the default hash function in the interest of improved 
performance.

Performance

With the Shardmap userspace prototype we were able to obtain some early
performance numbers, which I will describe in general terms with precise 
details to follow. Roughly speaking, both create and delete run at 
around 150 nanoseconds per operation, and do not appear to degrade 
significantly as directory size increases into the hundreds of millions. 
As expected, lookup operations are faster, roughly 100 nanoseconds each. 
This is with a modestly specced workstation and a consumer grade SSD, so 
spinning disk seek performance is not yet being measured.

In general, performance numbers obtained so far match HTree at modest 
scale, definitively dominate at higher scales in the millions of files, 
and continue on up into scales orders of magnitude higher than HTree can 
even attempt.

Directory traversal is in file order for Shardmap, compared to hash 
order for HTree, which entails a computationally intensive procedure of 
sorting and caching intermediate search results required for correct 
directory cookie semantics, a clear win for Shardmap. The effect of 
avoiding HTree's inode table thrashing behaviour has not yet been 
measured (this must wait for a kernel implementation) however it is safe 
to assume that this will be a clear win for Shardmap as well.

Implementation

A nearly complete userspace prototype including unit tests is now 
available in the Tux3 repository:

    https://github.com/OGAWAHirofumi/tux3/blob/master/user/devel/shard.c

We encourage interested observers to compile and run this standalone 
code in order to verify our performance claims. It is also worth seeing 
for yourself just how simple this code is.

The prototype currently lacks the following two important elements:

    * The reshard operation. So far not tested, so we cannot speak to
      the performance impact of reshard just yet, other than from
      estimates that indicate it is small.

    * Free record mapping. Likewise, this will add some additional,
      modest update overhead that we have not yet measured, only
      estimated.

To be useful, Shardmap needs a kernel implementation, which we plan to 
defer until after merging with mainline. Scalable directories are 
certainly nice, but not essential for evaluating the fitness of Tux3 as 
a filesystem in the context of personal or light server use. I will 
comment further on the design of the kernel implementation in a later post.

Summary

Shardmap is a new directory index design created expressly for the Tux3
filesystem, but holds promise in application areas beyond that as an 
upgrade to existing filesystems and probably also being applicable to 
database design. Shardmap arguably constitutes a breakthrough in 
performance, scalability and simplicity in the arcane field of directory 
indexing, solving a number of well known and notoriously difficult 
performance problems as it does.

We expect a kernel implementation of Shardmap to land some time in the 
next few months. In the mean time, we are now satisfied with this key 
aspect of the Tux3 design and will turn our attention back to the 
handful of remaining issues that need to be addressed before offering 
Tux3 for mainline kernel merge.

Regards,

Daniel

[1] This arguably indicates a flaw in the classic radix tree design, 
possibly correctable to work better with sparse files.

[2] Maybe we should allow kernel modules to own and operate their own 
virtual address tables, that is another story.