<div dir="ltr"><br><br><div class="gmail_quote">On Tue, Aug 19, 2008 at 9:54 PM, Daniel Phillips <span dir="ltr"><<a href="mailto:phillips@phunq.net">phillips@phunq.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

A quick look at the results from the new btree unit tests showed all<br>

nodes 50% full.  This is because the key insertion test just inserts<br>

sequential keys, a good simulation of sequential file writing.  Each<br>

leaf gets split in the middle then no new keys will ever be inserted<br>

into the lower block of the split.  At least, not until versioning is<br>

supported and even then the common case is to truncate to empty,<br>

creating an entirely new file data attribute that will then exhibit<br>

the same undesirable behavior.<br>

<br>

So why go to all the effort of making our file index leaf format<br>

really compact (thanks Shapor) if we are then going to throw away half<br>

of it?  Splitting in the middle is not necessarily the right thing to<br>

do.<br>

<br>

The latest check-in now passes the target key to the split method<br>

so it can make a more intelligent decision.  The btree unit test now<br>

just checks to see if the new key is beyond the final key in the<br>

(presumably full) block and splits off a new, empty block if so.  The<br>

result for sequential key creation is now 100% full leaves, and the<br>

behavior for random key insertions (random file writing, rare except<br>

for database type loads) should be hardly affected.  This looks like<br>

a big win for metadata compactness... twice as compact for two new<br>

lines of new code.  Good deal.<br>

</blockquote><div><br>But would this mean that when later adding to that btree block, it would split into two ~50% filled blocks? An useful trick some databases do is to split a 100% block at the far end of the tree into a 90/10 split, so that later small modifications on the 90% block don't need an immediate split into 50/50.<br>

 </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>

This idea can be generalized somewhat.  Splitting should binsearch on<br>

the key and split roughly where the key lands, with perhaps some<br>

adjustment to accommodate large items that dominate some leaf.<br>

<br>

The best place to split depends heavily on the role that a particular<br>

btree plays.  For file index trees we pretty obviously need to cater<br>

to the expected serial insertion behavior.  The inode table is not so<br>

obvious; that depends to a large extent on what kind of inum selection<br>

policy we come up with.  But we can expect a certain amount of serial<br>

effect there too: we want to allocate inums for files created in the<br>

same directory sequentially.  We will also see a lot more random<br>

creation in the inode table as files are created in directories all<br>

over the volume with different goal inums.  Inum targetting for inode<br>

creation is the next thing on my list, so these ideas will develop as<br>

that effort proceeds.<br>

<br>

Interior btree nodes tend to suffer from the same problem.  For now I<br>

will just let them split down to half size as they do, and later when<br>

interior nodes get their own specialized methods, apply similar<br>

heuristics to the split as for leaves.<br>

<br>

There are cases where it really is best to split right in the middle,<br>

for example, HTree which keys on a hash of the filename that is<br>

designed to be as random as possible.  Splitting exactly in the middle<br>

results in average 75% leaf fullness, not too bad.  To improve on that<br>

we would need a more complex splitting algorithm along the lines of<br>

b+tree, which splits two leaves into three.  Since the split method is<br>

customized per tree type, this special requirement fits nicely into the<br>

design framework.<br>

</blockquote><div><br>The though process described here seems to be:<br><br>1. split 1 block into 2<br>2. enhancement: split 2 blocks into 3<br>3. enhancement: split 3 blocks into 4<br><br>If you iterate this process, you end up with the reiserfs4 way:<br>

<br>When need to flush a whole lot of dirty blocks which are or are not 100% full in memory: compact all of them at the same time into the minimum amount of blocks, then write that to disk.<br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">

<br>

Regards,<br>

<br>

Daniel<br>

<br>

_______________________________________________<br>

Tux3 mailing list<br>

<a href="mailto:Tux3@tux3.org">Tux3@tux3.org</a><br>

<a href="http://tux3.org/cgi-bin/mailman/listinfo/tux3" target="_blank">http://tux3.org/cgi-bin/mailman/listinfo/tux3</a><br>

</blockquote></div><br><br clear="all"><br>-- <br>Greetz, Antonio Vargas aka winden of rgba^ntw^bg<br><br><a href="http://winden.wordpress.com/">http://winden.wordpress.com/</a><br><a href="mailto:windenntw@gmail.com">windenntw@gmail.com</a><br>

<br>Every day, every year<br>you have to work<br>you have to study<br>you have to scene.<br>

</div>