<div dir="ltr"><br><br><div class="gmail_quote">On Tue, Aug 19, 2008 at 9:54 PM, Daniel Phillips <span dir="ltr"><<a href="mailto:phillips@phunq.net">phillips@phunq.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
A quick look at the results from the new btree unit tests showed all<br>
nodes 50% full. This is because the key insertion test just inserts<br>
sequential keys, a good simulation of sequential file writing. Each<br>
leaf gets split in the middle then no new keys will ever be inserted<br>
into the lower block of the split. At least, not until versioning is<br>
supported and even then the common case is to truncate to empty,<br>
creating an entirely new file data attribute that will then exhibit<br>
the same undesirable behavior.<br>
<br>
So why go to all the effort of making our file index leaf format<br>
really compact (thanks Shapor) if we are then going to throw away half<br>
of it? Splitting in the middle is not necessarily the right thing to<br>
do.<br>
<br>
The latest check-in now passes the target key to the split method<br>
so it can make a more intelligent decision. The btree unit test now<br>
just checks to see if the new key is beyond the final key in the<br>
(presumably full) block and splits off a new, empty block if so. The<br>
result for sequential key creation is now 100% full leaves, and the<br>
behavior for random key insertions (random file writing, rare except<br>
for database type loads) should be hardly affected. This looks like<br>
a big win for metadata compactness... twice as compact for two new<br>
lines of new code. Good deal.<br>
</blockquote><div><br>But would this mean that when later adding to that btree block, it would split into two ~50% filled blocks? An useful trick some databases do is to split a 100% block at the far end of the tree into a 90/10 split, so that later small modifications on the 90% block don't need an immediate split into 50/50.<br>
</div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;"><br>
This idea can be generalized somewhat. Splitting should binsearch on<br>
the key and split roughly where the key lands, with perhaps some<br>
adjustment to accommodate large items that dominate some leaf.<br>
<br>
The best place to split depends heavily on the role that a particular<br>
btree plays. For file index trees we pretty obviously need to cater<br>
to the expected serial insertion behavior. The inode table is not so<br>
obvious; that depends to a large extent on what kind of inum selection<br>
policy we come up with. But we can expect a certain amount of serial<br>
effect there too: we want to allocate inums for files created in the<br>
same directory sequentially. We will also see a lot more random<br>
creation in the inode table as files are created in directories all<br>
over the volume with different goal inums. Inum targetting for inode<br>
creation is the next thing on my list, so these ideas will develop as<br>
that effort proceeds.<br>
<br>
Interior btree nodes tend to suffer from the same problem. For now I<br>
will just let them split down to half size as they do, and later when<br>
interior nodes get their own specialized methods, apply similar<br>
heuristics to the split as for leaves.<br>
<br>
There are cases where it really is best to split right in the middle,<br>
for example, HTree which keys on a hash of the filename that is<br>
designed to be as random as possible. Splitting exactly in the middle<br>
results in average 75% leaf fullness, not too bad. To improve on that<br>
we would need a more complex splitting algorithm along the lines of<br>
b+tree, which splits two leaves into three. Since the split method is<br>
customized per tree type, this special requirement fits nicely into the<br>
design framework.<br>
</blockquote><div><br>The though process described here seems to be:<br><br>1. split 1 block into 2<br>2. enhancement: split 2 blocks into 3<br>3. enhancement: split 3 blocks into 4<br><br>If you iterate this process, you end up with the reiserfs4 way:<br>
<br>When need to flush a whole lot of dirty blocks which are or are not 100% full in memory: compact all of them at the same time into the minimum amount of blocks, then write that to disk.<br><br> </div><blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">
<br>
Regards,<br>
<br>
Daniel<br>
<br>
_______________________________________________<br>
Tux3 mailing list<br>
<a href="mailto:Tux3@tux3.org">Tux3@tux3.org</a><br>
<a href="http://tux3.org/cgi-bin/mailman/listinfo/tux3" target="_blank">http://tux3.org/cgi-bin/mailman/listinfo/tux3</a><br>
</blockquote></div><br><br clear="all"><br>-- <br>Greetz, Antonio Vargas aka winden of rgba^ntw^bg<br><br><a href="http://winden.wordpress.com/">http://winden.wordpress.com/</a><br><a href="mailto:windenntw@gmail.com">windenntw@gmail.com</a><br>
<br>Every day, every year<br>you have to work<br>you have to study<br>you have to scene.<br>
</div>