[Tux3] Splitting in the middle considered harmful

Antonio Vargas windenntw at gmail.com
Thu Aug 21 02:44:10 PDT 2008

On Tue, Aug 19, 2008 at 9:54 PM, Daniel Phillips <phillips at phunq.net> wrote:

> A quick look at the results from the new btree unit tests showed all
> nodes 50% full.  This is because the key insertion test just inserts
> sequential keys, a good simulation of sequential file writing.  Each
> leaf gets split in the middle then no new keys will ever be inserted
> into the lower block of the split.  At least, not until versioning is
> supported and even then the common case is to truncate to empty,
> creating an entirely new file data attribute that will then exhibit
> the same undesirable behavior.
> So why go to all the effort of making our file index leaf format
> really compact (thanks Shapor) if we are then going to throw away half
> of it?  Splitting in the middle is not necessarily the right thing to
> do.
> The latest check-in now passes the target key to the split method
> so it can make a more intelligent decision.  The btree unit test now
> just checks to see if the new key is beyond the final key in the
> (presumably full) block and splits off a new, empty block if so.  The
> result for sequential key creation is now 100% full leaves, and the
> behavior for random key insertions (random file writing, rare except
> for database type loads) should be hardly affected.  This looks like
> a big win for metadata compactness... twice as compact for two new
> lines of new code.  Good deal.

But would this mean that when later adding to that btree block, it would
split into two ~50% filled blocks? An useful trick some databases do is to
split a 100% block at the far end of the tree into a 90/10 split, so that
later small modifications on the 90% block don't need an immediate split
into 50/50.

> This idea can be generalized somewhat.  Splitting should binsearch on
> the key and split roughly where the key lands, with perhaps some
> adjustment to accommodate large items that dominate some leaf.
> The best place to split depends heavily on the role that a particular
> btree plays.  For file index trees we pretty obviously need to cater
> to the expected serial insertion behavior.  The inode table is not so
> obvious; that depends to a large extent on what kind of inum selection
> policy we come up with.  But we can expect a certain amount of serial
> effect there too: we want to allocate inums for files created in the
> same directory sequentially.  We will also see a lot more random
> creation in the inode table as files are created in directories all
> over the volume with different goal inums.  Inum targetting for inode
> creation is the next thing on my list, so these ideas will develop as
> that effort proceeds.
> Interior btree nodes tend to suffer from the same problem.  For now I
> will just let them split down to half size as they do, and later when
> interior nodes get their own specialized methods, apply similar
> heuristics to the split as for leaves.
> There are cases where it really is best to split right in the middle,
> for example, HTree which keys on a hash of the filename that is
> designed to be as random as possible.  Splitting exactly in the middle
> results in average 75% leaf fullness, not too bad.  To improve on that
> we would need a more complex splitting algorithm along the lines of
> b+tree, which splits two leaves into three.  Since the split method is
> customized per tree type, this special requirement fits nicely into the
> design framework.

The though process described here seems to be:

1. split 1 block into 2
2. enhancement: split 2 blocks into 3
3. enhancement: split 3 blocks into 4

If you iterate this process, you end up with the reiserfs4 way:

When need to flush a whole lot of dirty blocks which are or are not 100%
full in memory: compact all of them at the same time into the minimum amount
of blocks, then write that to disk.

> Regards,
> Daniel
> _______________________________________________
> Tux3 mailing list
> Tux3 at tux3.org
> http://tux3.org/cgi-bin/mailman/listinfo/tux3

Greetz, Antonio Vargas aka winden of rgba^ntw^bg

windenntw at gmail.com

Every day, every year
you have to work
you have to study
you have to scene.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phunq.net/pipermail/tux3/attachments/20080821/ee68498f/attachment-0001.html>
-------------- next part --------------
Tux3 mailing list
Tux3 at tux3.org

More information about the Tux3 mailing list