PHTree performance analysis

Sun Jan 6 12:12:12 PST 2013

So, some thoughts.

I don't yet have my brain wrapped around htree/phtree well enough do any
kind of real performance analysis of it, but maybe Daniel can fill in
some of the gaps.

I'm less concerned with lookup speed, because I think for our purposes
the on disk format isn't that important here. The fact that disk IO is
so much slower than CPUs these days means that even if your on disk
format is terrible w.r.t. lookup performance, you can afford to do a lot
of work when you read in a node/metadata block/whatever to fix that.

IOW, if you know an optimal in memory data structure you can use it
regardless of what's on disk. Handwaving details away, but this'll be
true for any data structure than looks vaguely like a btree.

Also, I don't know of any mathematical framework for analyzing
performance of data structures with the parameters we care about and
there probably should be - that's the angle I'm approaching things from.
It's become clear to me that some kind of notion of degrees of freedom
is really crucial here. (Also the problems with on disk data structures
are not specific to disks - memory has the same fundamental
characteristics and they're (probably) slowly becoming more important,
c.f. judy trees).

Anyways.

Something obvious but maybe worth noting up front is that if it wasn't
for the telldir problem, indexing directories would be just like
indexing anything else and we'd probably be best off using the same data
structure for dirents as for extents/inodes/whatever else. We still
might be, I don't know.

What this means is an index specialized for dirents is going to have
some advantages (due to not needing a second index for telldir id ->
dirname), but it's going to have some disadvantages and the crucial
question is going to be whether they're worth it.

So, as I mentioned I'm not too concerned about lookup performance, so
all that's really left is insertions (deletion is similar enough to
insertion from a performance analysis pov that I don't care).

With insertions, you've got to update the index at some point, so disk
IO is going to be the dominating factor.

To do a complete performance analysis, we can just about count the
number of IOs required per insert and ignore everything else.

And since dirents are so much smaller than your optimum write size, this
means the dominating factor on number of IOs required per insert is
going to be the amount of coalescing you can do.

Conventional data structures - btrees, hash tables - have the problem
that the place a key has to reside in the index (on disk) is very highly
determined. This means that doing random insertions/updates to your
index basically forces you to constantly rewrite the entire thing.

(This is what Daniel was getting at with cache issues, but IMO that's
the wrong way to think about this problem; you may have enough memory to
keep the entire index dirty in memory, but you still need to efficiently
persist changes unless it's for an application where performance
synchronizing the index on disk doesn't matter - in which case, why are
you using an on disk data structure at all?)

So, in recent years this has spawned at least a couple new schemes -
they're called compacting data structures. Basically, all
insertions/updates are grouped together, and they lazily percolate
up/out as your small mutable region fills up.

So, I'm going to _completely handwave_ here and say the optimal
performance for insertions/updates w.r.t. disk IO is O(n*log(n)) IOs per
insert, where each IO is completely full - i.e. we're doing the maximal
amount of compaction possible.

This is also assuming we have a reasonable amount of room to coalesce,
obviously if we have a workload where it's

insert();
sync();

that's going to be a problem.

But assuming we don't have forced syncs, it is possible to do inserts
basically as fast as we can write 512k blocks or whatever full of keys
out to disk. And, these databases have been around a bit to demonstrate
that that is more or less possible in practice.

So, that's why coalescing is important. So what I care about for phtree
is how much coalescing we can do; and, are there workloads that force us
to scribble all over a multi gigabyte index?

It's obviously going to be at somewhat of a disadvantage compared to a
pure compacting db - it's got a tree structure, stuff's in the leaves
and it branches based on a hash value. That isn't necessarily a big deal
though if your leaf nodes are big enough (this is the approach bcache
takes).

This is where Daniel is going to have to chime in though :)