Tux3 update - Shardmap

Mon Oct 9 22:57:38 PDT 2017

Hi all,

This is just a quick note to let everybody know that Tux3 development is 
active once again. There are a small number of areas we need to focus 
on, to make Tux3 ready for general (experimental!) use. One of those is 
directory scaling, that is, Shardmap.

Shardmap was conceived as a new-generation replacement for HTree, and as 
such, is aimed not just at Tux3, but Ext4 too. After all these years, 
Htree still remains arguably the best-performing directory index for 
moderate loads up to a few million files, but from there on it tends to 
hit thrashing issues. Shardmap is designed to fix those and scale up 
into the billions of files, while exhibiting similar performance at 
lower scale.

Shardmap is a scalable hash table design, as opposed to HTree, which is 
a hash-indexed btree. Htree thrashing comes from two main sources:

   1) Traversing a directory in hash order causes inode table to be 
traversed in random order, so at high scale, the working set tends to 
become the entire inode table

  2) A basic issue with all btrees: heavy random update load (e.g., 
creates) will dirty almost every btree leaf, forcing almost the entire 
index to be written per commit

HTree stores directory entries in the btree leaves, with no separate 
index structure. That is one of the things that makes it really fast. 
Shardmap breaks away from this by storing directory entries in 
Ext2-style directory entry blocks, and has a separate index higher up in 
the directory file.

To make this work well, the Shardmap index is very compact, just 8 bytes 
per directory entry. Htree has far less index overhead than that, 
approaching zero bytes per directory entry (!) but it also has a 
significant amount of empty space in each leaf block, about 25%, due to 
its btree splitting strategy. So actually, Shardmap directories should 
turn out to be about the same total size as HTree. But Shardmap has a 
huge advantage because of storing its entries roughly in creation order: 
traversing the directory via readdir tends to access the inode table in 
storage order. The goal here is to have multiple cache hits to the same 
inode table block, close together in time. This is about as 
cache-friendly as you can get.

To address the second issue, Shardmap updates its hash shards using a 
logging style, per shard. That is, it only dirties the tail block of 
each shard. Compared to a btree, that would be a small fraction of dirty 
blocks per commit under heavy create/update load. Shardmap's dirty cache 
footprint therefore promises to be less than 1% of Htree's.

Shardmap's in-cache hash lookup should perform nearly identically to 
HTree's binary search, with similarly low CPU consumption.

Shardmap uses a fundamentally different approach to cache than HTree. 
With HTree, directory blocks are simply mapped into page cache, where 
both reads and updates are done. Shardmap splits the cache objects into 
two parts: in-memory shards, each of which is a small hash table, and 
media shards, which are append-only regions of the directory file, from 
which in-memory shards can be reconstructed as needed. So Shardmap uses 
one format that is well suited to media, in combination with another 
format that is well suited to memory. As a consequence, we must manage 
eviction of shards explicitly rather than relying on the default page 
cache eviction mechanism as Htree does. Fortunately, this is a 
well-known kernel meme - no new ground to break here!

One of the things that makes HTree simple is, it does not need to 
implement free space management within a directory file. HTree always 
knows where new entries will be created: in the leaf block indexed by 
the name hash. And HTree is grow-only - a directory file never shrinks. 
In practice, this has not proved to be a problem, because large 
directories tend to stay large, and the user has at least one effective 
way to shrink a directory: just delete it. Shardmap too, will be 
grow-only, but unlike HTree, it must track free space in order to reuse 
empty records in directory blocks left there by deletes. It turns out 
that a very simple free space management technique is possible, which I 
will not detail here, except to say that the space overhead is less than 
.1% of the directory size. A special shout-out here goes to Ted Ty'o for 
hating my initial free space management proposal, an opinion he 
expressed during a noisy party at LinuxCon New Orleans a few years back. 
Ted was right, and I accordingly came up with something simpler, more 
efficient and friendlier to Ext4's way of doing things.

As opposed to Htree, which loads its index into cache one directory 
block at a time, Shardmap will load a full hash shard per cache miss, 
typically 2-4MB. Is that good or bad? On spinning disk, reading 4MB 
instead of 4KB costs less than the associated seek, so it is clearly a 
good idea, but what about SSD? My tentative conclusion is: this 
resembles read-ahead, in that loads exist where the extra blocks read 
are just pure overhead, but on average, at least some of them will be 
accessed in the near future. So with random reads in a cold Shardmap 
directory, the index cache will become fully populated with fewer read 
operations than Htree, by a factor of 100 or so.

Like HTree, Shardmap scales down well. A directory where all entries fit 
in a single block does not have any index at all, and consumes just a 
single block on disk. When Shardmap overflows one block, the directory 
becomes a sparse file a few hundred blocks in size, but with only 4 
blocks actually allocated. This is only one block more than HTree. The 
additional block is the free space map. Past that, Shardmap and HTree 
directories will have similar sizes, although the Shardmap directory 
size will appear to be larger because the Shardmap index is stored at a 
high logical offset in the sparse directory file. The directory entry 
blocks themselves are stored in the bottom part of the directory file, 
much lilke a traditional Ext2 directory, which makes for fast, 
cache-friendly full directory operations such as ls and rm -r. By 
storing and traversing directory entries in a strictly stable order, 
Shardmap avoids some very arcane things we had to do to implement Posix 
readdir semantics accurately in HTree. As a result, Shardmap's finished 
implementation will end up shorter and simpler than HTree's, with 
obviously correct Posix semantics.

Shardmap was prototyped about five years ago and basic characteristics 
were assessed, which were encouraging. Since then, it has just been 
hibernating as a prototype in the Tux3 repository. Now, active work has 
resumed, and remaining design details have been settled. I hope to 
report fairly soon that the prototype has progressed to a kernel 
implementation with performance measurements. With a proper directory 
index, we will get rid of O(n**2) metadata performance at high scale, 
and it will be time to present some new benchmarks.

Regards,

Daniel