PHTree performance analysis

Daniel Phillips daniel.raymond.phillips at gmail.com
Sun Jan 6 17:19:11 PST 2013


Thanks for the comments, Kent.

On Sun, Jan 6, 2013 at 12:12 PM, Kent Overstreet <koverstreet at google.com> wrote:
> I'm less concerned with lookup speed, because I think for our purposes
> the on disk format isn't that important here. The fact that disk IO is
> so much slower than CPUs these days means that even if your on disk
> format is terrible w.r.t. lookup performance, you can afford to do a lot
> of work when you read in a node/metadata block/whatever to fix that.

Well, in-cache performance - the common case - is entirely CPU, so
CPU efficiency needs to be good too, at least on the hot path.

> What this means is an index specialized for dirents is going to have
> some advantages (due to not needing a second index for telldir id ->
> dirname), but it's going to have some disadvantages and the crucial
> question is going to be whether they're worth it.

HTree gets away without a secondary telldir ordering by ordering on
the hash key, which is computed at a higher precision when telldir is
involved, and there is also special and elaborate handling for telldir
that happens to lie on a collision - which may be quite frequent,
because the 31 bit tell cookie can't hold the higher precision hash.
Anyway, it is scary but it works.

The big thrashing issue that comes up with HTree is, hash order
does not correspond to inode table order, causing thrashing in the
inode table. The inode table cache footprint is larger than the HTree
index footprint by a factor of roughly 1600, which is why HTree does
not need to worry about thrashing the index itself. And the only
reason anybody cares about thashing the inode table is, HTree
made directory in the range of millions sizes practical.

> So, as I mentioned I'm not too concerned about lookup performance,

I am, It seems to be about as good as it can be with PHtree, if we
accept the extra level of index nodes vs HTree as a necessity. But
I would not want to see it get a lot worse.

> so all that's really left is insertions (deletion is similar enough to
> insertion from a performance analysis pov that I don't care).

Deletion is nasty too, and thrashes the index the same as insertion.
But deletion does have one advantage over insertion: no need to do
an extra lookup for existence check.

> With insertions, you've got to update the index at some point, so disk
> IO is going to be the dominating factor.

Very true.

> To do a complete performance analysis, we can just about count the
> number of IOs required per insert and ignore everything else.

Actually, the biggest problem is write duplication, where any given
index block is written multiple times, up to its average branching factor
(approximately, because splitting and index growth complicate the
equation). A btree index that exceeds cache by a large multiple
will hit the maximum: every index will be written as many times as it
has entries, because a block must be flushed to disk each time some
random block is read, depending on the ratio between cache size and
index footprint.

Even when the index fits in cache, we have another problem to worry
about: periodic cache writeout is also a write multiplier, in fact it will
tend to dominate IO, because shortly after any block is flushed to disk
it is likely to be dirtied again. This is a particularly bad problem for
Tux3, which wants to flush out all dirty cache every delta. While that
is fine for most loads, it is obviously bad for this one and we most
probably need to do some thinking about other loads that can behave
similarly.

> And since dirents are so much smaller than your optimum write size, this
> means the dominating factor on number of IOs required per insert is
> going to be the amount of coalescing you can do.

The branching factor is about 400 for phtree, and can be raised to over
800 with some simple compression tricks in the deepest tree level.

> Conventional data structures - btrees, hash tables - have the problem
> that the place a key has to reside in the index (on disk) is very highly
> determined. This means that doing random insertions/updates to your
> index basically forces you to constantly rewrite the entire thing.

Right.

> (This is what Daniel was getting at with cache issues, but IMO that's
> the wrong way to think about this problem; you may have enough memory to
> keep the entire index dirty in memory, but you still need to efficiently
> persist changes unless it's for an application where performance
> synchronizing the index on disk doesn't matter - in which case, why are
> you using an on disk data structure at all?)

Two different but related issues, both serious.

> So, in recent years this has spawned at least a couple new schemes -
> they're called compacting data structures. Basically, all
> insertions/updates are grouped together, and they lazily percolate
> up/out as your small mutable region fills up.
>
> So, I'm going to _completely handwave_ here and say the optimal
> performance for insertions/updates w.r.t. disk IO is O(n*log(n)) IOs per
> insert, where each IO is completely full - i.e. we're doing the maximal
> amount of compaction possible.
>
> This is also assuming we have a reasonable amount of room to coalesce,
> obviously if we have a workload where it's
>
> insert();
> sync();
>
> that's going to be a problem.
>
> But assuming we don't have forced syncs, it is possible to do inserts
> basically as fast as we can write 512k blocks or whatever full of keys
> out to disk. And, these databases have been around a bit to demonstrate
> that that is more or less possible in practice.
>
> So, that's why coalescing is important. So what I care about for phtree
> is how much coalescing we can do; and, are there workloads that force us
> to scribble all over a multi gigabyte index?

This style of index (fractal is an example?) is getting more interesting to
me now that I have had my nose in the issue for a while. I do have in
mind an alternative solution which has the potential to rescue the bad
situation with btrees. (Note: this issue applies to all btree directory
schemes, not just phtree, and not just hash keyed trees.) The idea is to
flush out the index as slowly as possible in hash order while the index is
being built, then fully flush it when things quiet down. Directory blocks not
indexed are marked as such, recursively in a tree so that state bubbles
up and a crashed system can know efficiently which blocks need to be
re-indexed. This could still lead to longish restart times before quick
access is restored to the directory, but it would not be nearly as bad as
a fsck for example. A further problem is, we would need to teach Tux3
how to leave some blocks dirty in cache while flushing others, which it
has no idea how to do at the moment. But I am sure this is far from the
only case where we want to flush some parts of cache and not others.

> It's obviously going to be at somewhat of a disadvantage compared to a
> pure compacting db - it's got a tree structure, stuff's in the leaves
> and it branches based on a hash value. That isn't necessarily a big deal
> though if your leaf nodes are big enough (this is the approach bcache
> takes).
>
> This is where Daniel is going to have to chime in though :)

Chimed enough for you?

Daniel




More information about the Tux3 mailing list