Progress report?
Daniel Phillips
daniel at phunq.net
Wed Apr 4 21:48:18 PDT 2018
On 2018-04-03 12:30 AM, Raymond Jennings wrote:
> Are you guys close to getting merged into mainline?
>
> I think it's high time that btrfs got a healthy dose of competition
>
Hi Raymond,
For the time being we will continue to develop out-of-tree, while
continuing to track Linus's latest mainline kernel.
Currently, I am busy fixing Tux3's lack of directory indexing, which
becomes a performance bottleneck at more than a few hundred files per
directory. We need to fix this this before seriously putting Tux3 up
against other general purpose file systems.
We could have gone with a hash-keyed B-tree indexing scheme like
everybody else, but I felt we would be better off with a completely new
approach based on scalable hash tables. I actually prototyped Shardmap
back in 2012, to the point where I convinced myself that the technology
was capable of meeting or beating B-tree performance at all scales,
while not needing a huge hack to work around the basically impossible
problem of doing readdir in hash order.
Evolving that prototype into usable code has kept me busy for a few
months now. Problem number one was, a hash table does not scale
naturally like a B-tree, instead the entire table needs to be expanded
as directory size increases. A simple-minded implementation would cause
huge latency for the create that happens to trigger the expand. Instead,
Shardmap expands the hash table one shard at a time, where the latency
of expanding a single shard is just a couple of milliseconds, appearing
completely smooth to the user. The state of this incremental reshard, as
I call it, needs to be recorded in the directory file so that the
incremental re-shard will continue exactly where it left off if the
directory is re-opened. After some effort, that settled down to a simple
design where the index is represented as one or two "tiers" of hash
tables, depending on whether whether a reshard is in progress or not.
The lower tier merges incrementally into the upper tier until it
disappears, so that the entire hash index moves higher up in the
directory file over time, making room for a nice linear array of
directory entry blocks below it.
This linear array of directory entry blocks is one of the main points of
Shardmap. It means that readdir can use a simple logical directory
address for readdir position, which is really the only way to comply
accurately with Posix readdir semantics that were originally defined
with simple linear directory layout in mind. Linear directory layout
also gives the fastest and most cache-efficient readdir, so you can walk
through an arbitrarily large Shardmap directory at essentially media
transfer speed. Finally, we avoid an issue that Htree has, where walking
the directory in hash order means that the inode table is accessed in
random order, causing increased hash pressure and (in the case of
delete) increased write multiplication.
Our nice linear array of directory entry blocks brings up hard problem
number two: how to keep track of free space in directory entry blocks
due to deleted entries? HTree does not have that problem because it
always creates a new entry in the B-tree leaf that corresponds to the
entry's hash, and splits that block to create space if necessary. So
Shardmap needs something like a malloc, but because Shardmap competes
with Htree in performance, the cost of this has to be nearly zero. My
solution is a new algorithm called Bigmap, that records the largest free
entry per block with overhead of just one byte per block. Searching and
updating adds so little extra overhead that it is hard to measure.
Putting this all together, we got our reward: a directory index that
scales efficiently to the billion file range while also handling smaller
directories at least as efficiently as current B-tree schemes. Because a
file system directory is really just a kind of specialized key-value
store, we decided to compare Shardmap performance to standalone
databases, and we found Shardmap outperforming them at create, delete
and lookup for small data sets and large. This is by way of gaining
confidence that we did not overlook some even better way to do things.
Please excuse me for going into this perhaps a little more deeply than I
originally intended, but this should give you some idea where we are
right now, and why we prioritized the current development work ahead of
putting Tux3 up for LKML review once again. There is still more work to
do on the Shardmap front: this code must now be ported from userspace to
kernel, work currently in progress. After that, there are some
outstanding issues to take care of with seek optimization on spinning
disk. That will bring us to the point where we are ready to make our
case for mainline merge, without needing to explain away cases where we
do not currently come out on top in file system benchmarks.
Regards,
Daniel
More information about the Tux3
mailing list