Tux3 update - Shardmap

Tue Oct 10 01:04:39 PDT 2017

*a big series of whooping cheers*

Please please please oh please get this into the kernel ASAP.  I'm
very eager to migrate my system over.

Honestly the WAFL thing and especially the atime tree is most
promising, and performance is a big win apparently.  Plus it's a
nextgenner that I very much would like to see take the place of btrfs.

That and the separation between front and back end REALLY catches me
as a huge performance boost.

*bangs pom-poms together*

Keep going keep going I really really wanna use this in practice.

On Mon, Oct 9, 2017 at 10:57 PM, Daniel Phillips <daniel at phunq.net> wrote:
> Hi all,
>
> This is just a quick note to let everybody know that Tux3 development is
> active once again. There are a small number of areas we need to focus on, to
> make Tux3 ready for general (experimental!) use. One of those is directory
> scaling, that is, Shardmap.
>
> Shardmap was conceived as a new-generation replacement for HTree, and as
> such, is aimed not just at Tux3, but Ext4 too. After all these years, Htree
> still remains arguably the best-performing directory index for moderate
> loads up to a few million files, but from there on it tends to hit thrashing
> issues. Shardmap is designed to fix those and scale up into the billions of
> files, while exhibiting similar performance at lower scale.
>
> Shardmap is a scalable hash table design, as opposed to HTree, which is a
> hash-indexed btree. Htree thrashing comes from two main sources:
>
>   1) Traversing a directory in hash order causes inode table to be traversed
> in random order, so at high scale, the working set tends to become the
> entire inode table
>
>  2) A basic issue with all btrees: heavy random update load (e.g., creates)
> will dirty almost every btree leaf, forcing almost the entire index to be
> written per commit
>
> HTree stores directory entries in the btree leaves, with no separate index
> structure. That is one of the things that makes it really fast. Shardmap
> breaks away from this by storing directory entries in Ext2-style directory
> entry blocks, and has a separate index higher up in the directory file.
>
> To make this work well, the Shardmap index is very compact, just 8 bytes per
> directory entry. Htree has far less index overhead than that, approaching
> zero bytes per directory entry (!) but it also has a significant amount of
> empty space in each leaf block, about 25%, due to its btree splitting
> strategy. So actually, Shardmap directories should turn out to be about the
> same total size as HTree. But Shardmap has a huge advantage because of
> storing its entries roughly in creation order: traversing the directory via
> readdir tends to access the inode table in storage order. The goal here is
> to have multiple cache hits to the same inode table block, close together in
> time. This is about as cache-friendly as you can get.
>
> To address the second issue, Shardmap updates its hash shards using a
> logging style, per shard. That is, it only dirties the tail block of each
> shard. Compared to a btree, that would be a small fraction of dirty blocks
> per commit under heavy create/update load. Shardmap's dirty cache footprint
> therefore promises to be less than 1% of Htree's.
>
> Shardmap's in-cache hash lookup should perform nearly identically to HTree's
> binary search, with similarly low CPU consumption.
>
> Shardmap uses a fundamentally different approach to cache than HTree. With
> HTree, directory blocks are simply mapped into page cache, where both reads
> and updates are done. Shardmap splits the cache objects into two parts:
> in-memory shards, each of which is a small hash table, and media shards,
> which are append-only regions of the directory file, from which in-memory
> shards can be reconstructed as needed. So Shardmap uses one format that is
> well suited to media, in combination with another format that is well suited
> to memory. As a consequence, we must manage eviction of shards explicitly
> rather than relying on the default page cache eviction mechanism as Htree
> does. Fortunately, this is a well-known kernel meme - no new ground to break
> here!
>
> One of the things that makes HTree simple is, it does not need to implement
> free space management within a directory file. HTree always knows where new
> entries will be created: in the leaf block indexed by the name hash. And
> HTree is grow-only - a directory file never shrinks. In practice, this has
> not proved to be a problem, because large directories tend to stay large,
> and the user has at least one effective way to shrink a directory: just
> delete it. Shardmap too, will be grow-only, but unlike HTree, it must track
> free space in order to reuse empty records in directory blocks left there by
> deletes. It turns out that a very simple free space management technique is
> possible, which I will not detail here, except to say that the space
> overhead is less than .1% of the directory size. A special shout-out here
> goes to Ted Ty'o for hating my initial free space management proposal, an
> opinion he expressed during a noisy party at LinuxCon New Orleans a few
> years back. Ted was right, and I accordingly came up with something simpler,
> more efficient and friendlier to Ext4's way of doing things.
>
> As opposed to Htree, which loads its index into cache one directory block at
> a time, Shardmap will load a full hash shard per cache miss, typically
> 2-4MB. Is that good or bad? On spinning disk, reading 4MB instead of 4KB
> costs less than the associated seek, so it is clearly a good idea, but what
> about SSD? My tentative conclusion is: this resembles read-ahead, in that
> loads exist where the extra blocks read are just pure overhead, but on
> average, at least some of them will be accessed in the near future. So with
> random reads in a cold Shardmap directory, the index cache will become fully
> populated with fewer read operations than Htree, by a factor of 100 or so.
>
> Like HTree, Shardmap scales down well. A directory where all entries fit in
> a single block does not have any index at all, and consumes just a single
> block on disk. When Shardmap overflows one block, the directory becomes a
> sparse file a few hundred blocks in size, but with only 4 blocks actually
> allocated. This is only one block more than HTree. The additional block is
> the free space map. Past that, Shardmap and HTree directories will have
> similar sizes, although the Shardmap directory size will appear to be larger
> because the Shardmap index is stored at a high logical offset in the sparse
> directory file. The directory entry blocks themselves are stored in the
> bottom part of the directory file, much lilke a traditional Ext2 directory,
> which makes for fast, cache-friendly full directory operations such as ls
> and rm -r. By storing and traversing directory entries in a strictly stable
> order, Shardmap avoids some very arcane things we had to do to implement
> Posix readdir semantics accurately in HTree. As a result, Shardmap's
> finished implementation will end up shorter and simpler than HTree's, with
> obviously correct Posix semantics.
>
> Shardmap was prototyped about five years ago and basic characteristics were
> assessed, which were encouraging. Since then, it has just been hibernating
> as a prototype in the Tux3 repository. Now, active work has resumed, and
> remaining design details have been settled. I hope to report fairly soon
> that the prototype has progressed to a kernel implementation with
> performance measurements. With a proper directory index, we will get rid of
> O(n**2) metadata performance at high scale, and it will be time to present
> some new benchmarks.
>
> Regards,
>
> Daniel
>
> _______________________________________________
> Tux3 mailing list
> Tux3 at phunq.net
> http://phunq.net/mailman/listinfo/tux3