[Tux3] Splitting in the middle considered harmful

Daniel Phillips phillips at phunq.net
Tue Aug 19 13:54:53 PDT 2008


A quick look at the results from the new btree unit tests showed all
nodes 50% full.  This is because the key insertion test just inserts
sequential keys, a good simulation of sequential file writing.  Each
leaf gets split in the middle then no new keys will ever be inserted
into the lower block of the split.  At least, not until versioning is
supported and even then the common case is to truncate to empty,
creating an entirely new file data attribute that will then exhibit
the same undesirable behavior.

So why go to all the effort of making our file index leaf format
really compact (thanks Shapor) if we are then going to throw away half
of it?  Splitting in the middle is not necessarily the right thing to
do.

The latest check-in now passes the target key to the split method
so it can make a more intelligent decision.  The btree unit test now
just checks to see if the new key is beyond the final key in the
(presumably full) block and splits off a new, empty block if so.  The
result for sequential key creation is now 100% full leaves, and the
behavior for random key insertions (random file writing, rare except
for database type loads) should be hardly affected.  This looks like
a big win for metadata compactness... twice as compact for two new
lines of new code.  Good deal.

This idea can be generalized somewhat.  Splitting should binsearch on
the key and split roughly where the key lands, with perhaps some
adjustment to accommodate large items that dominate some leaf.

The best place to split depends heavily on the role that a particular
btree plays.  For file index trees we pretty obviously need to cater
to the expected serial insertion behavior.  The inode table is not so
obvious; that depends to a large extent on what kind of inum selection
policy we come up with.  But we can expect a certain amount of serial
effect there too: we want to allocate inums for files created in the
same directory sequentially.  We will also see a lot more random
creation in the inode table as files are created in directories all
over the volume with different goal inums.  Inum targetting for inode
creation is the next thing on my list, so these ideas will develop as
that effort proceeds.

Interior btree nodes tend to suffer from the same problem.  For now I
will just let them split down to half size as they do, and later when
interior nodes get their own specialized methods, apply similar
heuristics to the split as for leaves.

There are cases where it really is best to split right in the middle,
for example, HTree which keys on a hash of the filename that is
designed to be as random as possible.  Splitting exactly in the middle
results in average 75% leaf fullness, not too bad.  To improve on that
we would need a more complex splitting algorithm along the lines of
b+tree, which splits two leaves into three.  Since the split method is
customized per tree type, this special requirement fits nicely into the
design framework.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list