[Tux3] Splitting in the middle considered harmful

Daniel Phillips phillips at phunq.net
Thu Aug 21 03:05:20 PDT 2008


On Thursday 21 August 2008 02:44, Antonio Vargas wrote:
> > The latest check-in now passes the target key to the split method
> > so it can make a more intelligent decision.  The btree unit test now
> > just checks to see if the new key is beyond the final key in the
> > (presumably full) block and splits off a new, empty block if so.  The
> > result for sequential key creation is now 100% full leaves, and the
> > behavior for random key insertions (random file writing, rare except
> > for database type loads) should be hardly affected.  This looks like
> > a big win for metadata compactness... twice as compact for two new
> > lines of new code.  Good deal.
> >
> 
> But would this mean that when later adding to that btree block, it would
> split into two ~50% filled blocks? An useful trick some databases do is to
> split a 100% block at the far end of the tree into a 90/10 split, so that
> later small modifications on the 90% block don't need an immediate split
> into 50/50.

It might indeed be a good thing to do.  But splitting into 50/50 isn't
so bad either, just for a few randomly accessed blocks.  I suppose a
database load like:

  1) dd if=/dev/zero of=mydb count=huge
  2) dbapp mydb

would create lots of 100% full blocks in step one then turn them all
into 50% full blocks in step 2.  But when the first snapshot comes
along then the 50% full blocks are going to start to fill up again with
new versioned pointers, so I don't think this is so bad.  If we have
50% full index blocks just for a few rare db type files, that is still
better than the way it was, where all the common sequential files had
50% full blocks.

95/5 anyone?
 
> > There are cases where it really is best to split right in the middle,
> > for example, HTree which keys on a hash of the filename that is
> > designed to be as random as possible.  Splitting exactly in the middle
> > results in average 75% leaf fullness, not too bad.  To improve on that
> > we would need a more complex splitting algorithm along the lines of
> > b+tree, which splits two leaves into three.  Since the split method is
> > customized per tree type, this special requirement fits nicely into the
> > design framework.
> >
> 
> The though process described here seems to be:
> 
> 1. split 1 block into 2
> 2. enhancement: split 2 blocks into 3
> 3. enhancement: split 3 blocks into 4
> 
> If you iterate this process, you end up with the reiserfs4 way:
> 
> When need to flush a whole lot of dirty blocks which are or are not 100%
> full in memory: compact all of them at the same time into the minimum amount
> of blocks, then write that to disk.

Right, it is a good argument for the importance of deferred allocation,
which Tux3 is doing right from the start, and the cool thing is, it
needed very little code.  I will try to take care to mark the places
that have to be elaborated to go to b+tree or b++tree etc allocation as
you describe.

A similar thing has to happen in version delete, which will dirty masses
of blocks that have to be compacted and written back to disk.  This has
been working pretty well in ddsnap for quite some time.

Regards,

Daniel

_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3



More information about the Tux3 mailing list