Tux3 update - Shardmap
daniel at phunq.net
Mon Oct 9 22:57:38 PDT 2017
This is just a quick note to let everybody know that Tux3 development is
active once again. There are a small number of areas we need to focus
on, to make Tux3 ready for general (experimental!) use. One of those is
directory scaling, that is, Shardmap.
Shardmap was conceived as a new-generation replacement for HTree, and as
such, is aimed not just at Tux3, but Ext4 too. After all these years,
Htree still remains arguably the best-performing directory index for
moderate loads up to a few million files, but from there on it tends to
hit thrashing issues. Shardmap is designed to fix those and scale up
into the billions of files, while exhibiting similar performance at
Shardmap is a scalable hash table design, as opposed to HTree, which is
a hash-indexed btree. Htree thrashing comes from two main sources:
1) Traversing a directory in hash order causes inode table to be
traversed in random order, so at high scale, the working set tends to
become the entire inode table
2) A basic issue with all btrees: heavy random update load (e.g.,
creates) will dirty almost every btree leaf, forcing almost the entire
index to be written per commit
HTree stores directory entries in the btree leaves, with no separate
index structure. That is one of the things that makes it really fast.
Shardmap breaks away from this by storing directory entries in
Ext2-style directory entry blocks, and has a separate index higher up in
the directory file.
To make this work well, the Shardmap index is very compact, just 8 bytes
per directory entry. Htree has far less index overhead than that,
approaching zero bytes per directory entry (!) but it also has a
significant amount of empty space in each leaf block, about 25%, due to
its btree splitting strategy. So actually, Shardmap directories should
turn out to be about the same total size as HTree. But Shardmap has a
huge advantage because of storing its entries roughly in creation order:
traversing the directory via readdir tends to access the inode table in
storage order. The goal here is to have multiple cache hits to the same
inode table block, close together in time. This is about as
cache-friendly as you can get.
To address the second issue, Shardmap updates its hash shards using a
logging style, per shard. That is, it only dirties the tail block of
each shard. Compared to a btree, that would be a small fraction of dirty
blocks per commit under heavy create/update load. Shardmap's dirty cache
footprint therefore promises to be less than 1% of Htree's.
Shardmap's in-cache hash lookup should perform nearly identically to
HTree's binary search, with similarly low CPU consumption.
Shardmap uses a fundamentally different approach to cache than HTree.
With HTree, directory blocks are simply mapped into page cache, where
both reads and updates are done. Shardmap splits the cache objects into
two parts: in-memory shards, each of which is a small hash table, and
media shards, which are append-only regions of the directory file, from
which in-memory shards can be reconstructed as needed. So Shardmap uses
one format that is well suited to media, in combination with another
format that is well suited to memory. As a consequence, we must manage
eviction of shards explicitly rather than relying on the default page
cache eviction mechanism as Htree does. Fortunately, this is a
well-known kernel meme - no new ground to break here!
One of the things that makes HTree simple is, it does not need to
implement free space management within a directory file. HTree always
knows where new entries will be created: in the leaf block indexed by
the name hash. And HTree is grow-only - a directory file never shrinks.
In practice, this has not proved to be a problem, because large
directories tend to stay large, and the user has at least one effective
way to shrink a directory: just delete it. Shardmap too, will be
grow-only, but unlike HTree, it must track free space in order to reuse
empty records in directory blocks left there by deletes. It turns out
that a very simple free space management technique is possible, which I
will not detail here, except to say that the space overhead is less than
.1% of the directory size. A special shout-out here goes to Ted Ty'o for
hating my initial free space management proposal, an opinion he
expressed during a noisy party at LinuxCon New Orleans a few years back.
Ted was right, and I accordingly came up with something simpler, more
efficient and friendlier to Ext4's way of doing things.
As opposed to Htree, which loads its index into cache one directory
block at a time, Shardmap will load a full hash shard per cache miss,
typically 2-4MB. Is that good or bad? On spinning disk, reading 4MB
instead of 4KB costs less than the associated seek, so it is clearly a
good idea, but what about SSD? My tentative conclusion is: this
resembles read-ahead, in that loads exist where the extra blocks read
are just pure overhead, but on average, at least some of them will be
accessed in the near future. So with random reads in a cold Shardmap
directory, the index cache will become fully populated with fewer read
operations than Htree, by a factor of 100 or so.
Like HTree, Shardmap scales down well. A directory where all entries fit
in a single block does not have any index at all, and consumes just a
single block on disk. When Shardmap overflows one block, the directory
becomes a sparse file a few hundred blocks in size, but with only 4
blocks actually allocated. This is only one block more than HTree. The
additional block is the free space map. Past that, Shardmap and HTree
directories will have similar sizes, although the Shardmap directory
size will appear to be larger because the Shardmap index is stored at a
high logical offset in the sparse directory file. The directory entry
blocks themselves are stored in the bottom part of the directory file,
much lilke a traditional Ext2 directory, which makes for fast,
cache-friendly full directory operations such as ls and rm -r. By
storing and traversing directory entries in a strictly stable order,
Shardmap avoids some very arcane things we had to do to implement Posix
readdir semantics accurately in HTree. As a result, Shardmap's finished
implementation will end up shorter and simpler than HTree's, with
obviously correct Posix semantics.
Shardmap was prototyped about five years ago and basic characteristics
were assessed, which were encouraging. Since then, it has just been
hibernating as a prototype in the Tux3 repository. Now, active work has
resumed, and remaining design details have been settled. I hope to
report fairly soon that the prototype has progressed to a kernel
implementation with performance measurements. With a proper directory
index, we will get rid of O(n**2) metadata performance at high scale,
and it will be time to present some new benchmarks.
More information about the Tux3