Progress report?

Wed Apr 4 21:48:18 PDT 2018

On 2018-04-03 12:30 AM, Raymond Jennings wrote:
> Are you guys close to getting merged into mainline?
>
> I think it's high time that btrfs got a healthy dose of competition
>

Hi Raymond,

For the time being we will continue to develop out-of-tree, while 
continuing to track Linus's latest mainline kernel.

Currently, I am busy fixing Tux3's lack of directory indexing, which 
becomes a performance bottleneck at more than a few hundred files per 
directory. We need to fix this this before seriously putting Tux3 up 
against other general purpose file systems.

We could have gone with a hash-keyed B-tree indexing scheme like 
everybody else, but I felt we would be better off with a completely new 
approach based on scalable hash tables. I actually prototyped Shardmap 
back in 2012, to the point where I convinced myself that the technology 
was capable of meeting or beating B-tree performance at all scales, 
while not needing a huge hack to work around the basically impossible 
problem of doing readdir in hash order.

Evolving that prototype into usable code has kept me busy for a few 
months now. Problem number one was, a hash table does not scale 
naturally like a B-tree, instead the entire table needs to be expanded 
as directory size increases. A simple-minded implementation would cause 
huge latency for the create that happens to trigger the expand. Instead, 
Shardmap expands the hash table one shard at a time, where the latency 
of expanding a single shard is just a couple of milliseconds, appearing 
completely smooth to the user. The state of this incremental reshard, as 
I call it, needs to be recorded in the directory file so that the 
incremental re-shard will continue exactly where it left off if the 
directory is re-opened. After some effort, that settled down to a simple 
design where the index is represented as one or two "tiers" of hash 
tables, depending on whether whether a reshard is in progress or not. 
The lower tier merges incrementally into the upper tier until it 
disappears, so that the entire hash index moves higher up in the 
directory file over time, making room for a nice linear array of 
directory entry blocks below it.

This linear array of directory entry blocks is one of the main points of 
Shardmap. It means that readdir can use a simple logical directory 
address for readdir position, which is really the only way to comply 
accurately with Posix readdir semantics that were originally defined 
with simple linear directory layout in mind. Linear directory layout 
also gives the fastest and most cache-efficient readdir, so you can walk 
through an arbitrarily large Shardmap directory at essentially media 
transfer speed. Finally, we avoid an issue that Htree has, where walking 
the directory in hash order means that the inode table is accessed in 
random order, causing increased hash pressure and (in the case of 
delete) increased write multiplication.

Our nice linear array of directory entry blocks brings up hard problem 
number two: how to keep track of free space in directory entry blocks 
due to deleted entries? HTree does not have that problem because it 
always creates a new entry in the B-tree leaf that corresponds to the 
entry's hash, and splits that block to create space if necessary. So 
Shardmap needs something like a malloc, but because Shardmap competes 
with Htree in performance, the cost of this has to be nearly zero. My 
solution is a new algorithm called Bigmap, that records the largest free 
entry per block with overhead of just one byte per block. Searching and 
updating adds so little extra overhead that it is hard to measure.

Putting this all together, we got our reward: a directory index that 
scales efficiently to the billion file range while also handling smaller 
directories at least as efficiently as current B-tree schemes. Because a 
file system directory is really just a kind of specialized key-value 
store, we decided to compare Shardmap performance to standalone 
databases, and we found Shardmap outperforming them at create, delete 
and lookup for small data sets and large. This is by way of gaining 
confidence that we did not overlook some even better way to do things.

Please excuse me for going into this perhaps a little more deeply than I 
originally intended, but this should give you some idea where we are 
right now, and why we prioritized the current development work ahead of 
putting Tux3 up for LKML review once again. There is still more work to 
do on the Shardmap front: this code must now be ported from userspace to 
kernel,  work currently in progress. After that, there are some 
outstanding issues to take care of with seek optimization on spinning 
disk. That will bring us to the point where we are ready to make our 
case for mainline merge, without needing to explain away cases where we 
do not currently come out on top in file system benchmarks.

Regards,

Daniel