[Tux3] Splitting in the middle considered harmful
Antonio Vargas
windenntw at gmail.com
Thu Aug 21 02:44:10 PDT 2008
On Tue, Aug 19, 2008 at 9:54 PM, Daniel Phillips <phillips at phunq.net> wrote:
> A quick look at the results from the new btree unit tests showed all
> nodes 50% full. This is because the key insertion test just inserts
> sequential keys, a good simulation of sequential file writing. Each
> leaf gets split in the middle then no new keys will ever be inserted
> into the lower block of the split. At least, not until versioning is
> supported and even then the common case is to truncate to empty,
> creating an entirely new file data attribute that will then exhibit
> the same undesirable behavior.
>
> So why go to all the effort of making our file index leaf format
> really compact (thanks Shapor) if we are then going to throw away half
> of it? Splitting in the middle is not necessarily the right thing to
> do.
>
> The latest check-in now passes the target key to the split method
> so it can make a more intelligent decision. The btree unit test now
> just checks to see if the new key is beyond the final key in the
> (presumably full) block and splits off a new, empty block if so. The
> result for sequential key creation is now 100% full leaves, and the
> behavior for random key insertions (random file writing, rare except
> for database type loads) should be hardly affected. This looks like
> a big win for metadata compactness... twice as compact for two new
> lines of new code. Good deal.
>
But would this mean that when later adding to that btree block, it would
split into two ~50% filled blocks? An useful trick some databases do is to
split a 100% block at the far end of the tree into a 90/10 split, so that
later small modifications on the 90% block don't need an immediate split
into 50/50.
>
> This idea can be generalized somewhat. Splitting should binsearch on
> the key and split roughly where the key lands, with perhaps some
> adjustment to accommodate large items that dominate some leaf.
>
> The best place to split depends heavily on the role that a particular
> btree plays. For file index trees we pretty obviously need to cater
> to the expected serial insertion behavior. The inode table is not so
> obvious; that depends to a large extent on what kind of inum selection
> policy we come up with. But we can expect a certain amount of serial
> effect there too: we want to allocate inums for files created in the
> same directory sequentially. We will also see a lot more random
> creation in the inode table as files are created in directories all
> over the volume with different goal inums. Inum targetting for inode
> creation is the next thing on my list, so these ideas will develop as
> that effort proceeds.
>
> Interior btree nodes tend to suffer from the same problem. For now I
> will just let them split down to half size as they do, and later when
> interior nodes get their own specialized methods, apply similar
> heuristics to the split as for leaves.
>
> There are cases where it really is best to split right in the middle,
> for example, HTree which keys on a hash of the filename that is
> designed to be as random as possible. Splitting exactly in the middle
> results in average 75% leaf fullness, not too bad. To improve on that
> we would need a more complex splitting algorithm along the lines of
> b+tree, which splits two leaves into three. Since the split method is
> customized per tree type, this special requirement fits nicely into the
> design framework.
>
The though process described here seems to be:
1. split 1 block into 2
2. enhancement: split 2 blocks into 3
3. enhancement: split 3 blocks into 4
If you iterate this process, you end up with the reiserfs4 way:
When need to flush a whole lot of dirty blocks which are or are not 100%
full in memory: compact all of them at the same time into the minimum amount
of blocks, then write that to disk.
>
> Regards,
>
> Daniel
>
> _______________________________________________
> Tux3 mailing list
> Tux3 at tux3.org
> http://tux3.org/cgi-bin/mailman/listinfo/tux3
>
--
Greetz, Antonio Vargas aka winden of rgba^ntw^bg
http://winden.wordpress.com/
windenntw at gmail.com
Every day, every year
you have to work
you have to study
you have to scene.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phunq.net/pipermail/tux3/attachments/20080821/ee68498f/attachment-0001.html>
-------------- next part --------------
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list