PHTree design refresh
phillips at phunq.net
Fri Jan 4 15:44:12 PST 2013
PHTree design refresh
This updates the previously posted PHTree design documentation:
http://lwn.net/Articles/291354/ "Tux3, a Versioning Filesystem"
As before, I describe PHTree in terms of HTree, which is actually a component
of PHTree in a slightly altered form. As before, I do not describe HTree
itself in the detail it really needs. That will be a project for later.
This refresh makes some minor and at least one big change and goes into more
detail than the above. The following design details were changed:
* Full 32 bit hash key compared to
* Collision bit is moved from the low bit of the has (HTree) to an unused
high bit of the block pointer.
* All dirent blocks now referenced by a rightmost partition of the phtree,
overloading the maximum hash value (-1U) as a key, and distinguishing these
special entries in the pleaf layer by overloading the collision bit, which is
otherwise unused in the pleaf layer.
* The important concept of size tagging, or lazy tags is explained more
fully. This is a free record tracking accelerator.
Hash keyed directory indexing was possibly pioneered by HTree (certainly on
Linux) and has since been adopted by many or most modern filesystem designs.
Its major advantage over traditional text-keyed btree indexing is high tree
fanout on the order of hundreds of entries per tree block. A second advantage
is fixed key size well suited to fast binary searching.
The biggest drawback to hash keying is that entries, or directory blocks in
the case of HTree, are keyed in intentionally random order, so that the
typical cache load becomes the entire index under mass operations. Worse, for
mass operations that involve the inode table, the cache load may be all the
inode table blocks referenced by the directory. The inode cache is typically
restricted to a fairly small size, so mass random operations can drop from
cache speed to media access speed, orders of magnitude slower.
A further difficulty with btree indexes in general is that directory traversal
via the Posix getdents interface must be in key order, otherwise entries
involved in leaf splits may be missed or repeated in the traversal, which
Posix does not allow. Compounding the issue, Posix supports only a signed long
telldir result, effectively 31 bits, or 63 bits on a 64 bit system or with 64
bit file support. Hash keys may collide, so directly using them as telldir
stream positions would not be reliable. Various elaborate solutions to this
problem have been proposed and implemented. HTree includes several hundred
lines of code devoted to dealing with this issue.
The in-cache performance advantages of hash keying generally outweigh its
drawbacks, so the technique remains popular, and typically works well up
into the region of several million names per directory. Ext3, Ext4 and
Lustre all use HTree. The latest filesystem designs attempt to handle much more
extreme loads, into the billions of names, a scale at which cache thrashing
becomes a real problem.
PHTree ("Physically stable HTree") is a successor to the HTree design and
partly includes HTree. PHTree is intended to scale comfortably into the range
of billions of files per directory, while trading off only a small amount of
performance versus HTree for small, entirely cachable directory loads. PHTree
uses a hash keyed index like HTree, however an extra layer of index blocks is
added to index individual directory entries as opposed to the per-block HTree
strategy. This increases the number of index blocks by the average branching
factor, so a PHTree index is much larger than an HTree index, but still only
amounts to about a third the contents of a directory. (HTree is unusually
lightweight, with only a fraction of a percent of each directory being
The lowest layer of a PHTree index consists of "pleaf" blocks. Each entry in a
pleaf indexes some dirent block containing a name matching a given hash value.
So fore every directory entry there is a pleaf entry. The purpose of the pleaf
layer is to "swizzle" the random hash index order to match the physical dirent
storage order. With PHTree, dirent blocks are never split, pleaf blocks are
split instead. This makes the PHTree dirent blocks "physically stable", that
is, each entry will stay at the same offset in the same dirent block from
creation to deletion. The byte (or four byte) file offset of the entry may
therefore be used for directory positioning, solving the difficult Posix telldir
issue tidily by exactly capturing the semantics of the earliest Unix
filesystems, on which the current Posix semantics are based.
The upper layers of a PHTree are an Htree, slightly improved to support full
32 bit hash keys as opposed to HTree's 31 bit keys. The collision bit still
performs the same important function in the HTree part of PHTree, however it
moves to a slightly different location, and as a side effect, adds another few
lines to adjust the result of the node binsearch algorithm in the presence of
collisions. (Handled in a cute way in HTree, that does not justify losing one
bit of hash.)
Because PHTree traverses directories in physical storage order, the inode
table is accessed in roughly the order that directory entries were created.
This should be roughly linear, giving a tiny cache footprint for mass
directory operations. We do need to consider aging effects as directory entries
are created and deleted. Ideally, available space for a new directory entry
can be found that corresponds roughly to the position of its inode number
within the range of inodes covered by the directory. The extent to which this
correspondence can be maintained governs the cache footprint of mass
PHTree can dispense with the elaborate workaround HTree uses to implement
reliable telldir/seekdir. But PHTree needs to take care of a new task: free
space management. HTree avoids this issue entirely because new space is always
obtained by splitting full blocks. In contrast, PHTree needs to keep track of
all free records in a directory and reuse them in preference to expanding the
directory, so that free space does not leak. PHTree also needs to prefer
free space near some particular goal offset determined by a new entry's inode
number, to maximize linear correspondence between the directory and the inode
table. PHTree's free space management must scale to multi-gigabyte directory
size. A linear free space search through the entire directory clearly will not
PHTree index free space using a new technique called "lazy tags", described in
(slightly) more detail below. To avoid introducing a separate index structure
just to index free space, PHTree stores the freespace index in the hash tree
itself, as a small partition at the right edge of the tree that overloads the
maximum hash value. Each dirent block has one entry in the plead level of the
freespace index. The higher level index blocks are not affected by this, and do
not need to know about it. Overloading the maximal hash value does not cause
problems for hash entries hashing to that value, because freespace entries are
strictly to the right of all hash entries at the pleaf level. Iterating a hash
probe across collisions stops when it reaches the first entry marked as a
freespace entry. As a nicety, HTree's hash collision bit, used to mark a range
of hash values where the first value of the range is split across two index
blocks, is repurposed to mark a freespace index, because entries in pleaf
blocks are exact hash values, not ranges.
Mapped into a file
Like HTree, PHTree blocks are part of a directory file and PHTree block
addresses are logical block offsets within that file, as opposed to physical
block offsets within a volume used by the inode table and data file trees.
Just like any file, the volume cache (volmap) is mapped into a page cache,
however the radix tree for a file is shallower and denser than the volmap radix
tree. Against this, a cache miss in a file cache must probe a data tree in the
volmap to load the missing block. So we have double-probe on miss compared to
fast probe on hit. This is usually a solid win for logically mapping.
For PHTree, there is an even more important advantage: the file bteee
effectively is a second directory index that imposes a stable ordering on
directory blocks. (PHtree calls this "physical" ordering, really only physical
in comparison with the "logical" ordering of the hash tree.) PHTree's stable
block ordering implements directory traversal in a traditional way, which is
arguably also the simplest and fastest.
There are also administrative advantages to logical mapping: the entire
overhead of a directory with its index can be accounted to a single file. If
recovery from filesystem damage is needed, index reconstruction can focus on a
single file, helping to factor reconstruction into a physical and a logical
phase, simplifying both and allowing for more meaningful interaction with the
Transition between unindexed and indexed directory
In terms of search complexity, the crossover point where indexed searching is
faster is at two blocks, just like HTree, though HTree clearly wins at two
blocks and PHTree only manages a tie, with a win at three blocks. To be
precise, with two dirent blocks in a directory, PHTree always does one binary
search in a pleaf and one linear search of half of one dirent block on
average, which a linear search would cover half of the dirent blocks, or in
other words, an average of one block. PHTree ekes out a small win over linear
search. However, linear search probes one block on average, while PHTree
and HTree always probe two. Since a block probe is just a highly optimized
radix tree lookup when a block is already in cache, this amounts to a near tie
between PHTree and linear search for the warm cache case. Linear edges out
both HTree and PHTree in the code cache case, but not by much. For both PHTree
and Htree, it is therefore convenient to add the directory index only when the
first, linearly searched dirent block becomes full.
Unlike HTree, the root of the index is the second block of the directory
instead of the first, to preserve telldir offsets in case a directory traversal
was in progress when the index is added.
Initially, the root of a PHTree index is a pleaf block. When the pleaf block
becomes full, the pleaf is replaced by an htree root index block, which
remains in place for the rest of the life of the directory. To probe the
directory, if it is three or more blocks long, read the second block and check
the header to see whether it is a tree root or a pleaf. If a root, probe
through the number of tree levels to reach a pleaf. Binary search the pleaf
just like any higher level node (it has the same format with slightly different
interpretation) then load and search each referenced dirent block that has the
target hash code until a different hash code is found, or the right edge of the
btree, indicated by a maximal hash code with collision bit set (this indicates
the block index partition of the btree; no other pleaf entry has the collision
bit set). If this search reaches the end of a pleaf without a match, advance
the path to the next pleaf and continue the search.
All directory index schemes have cache issues. The fundamental problem is that
the order imposed by the index may not correspond will to the objects
referenced by leaf entries. HTree has the problem that inode numbers in hash
order do not correspond will to inode table order. PHTree fixes this problem
but trades it for the new problem that pleaf entries are in random order with
respect to directory entries, compounded by the fact that the pleaf layer of
the index tree is roughly two orders of magnitude larger than all the HTree
levels combined. It seems that PHTree is an improvement over HTree
nonetheless, but analysis is needed to support that.
With PHTree we need to worry about whether the entire pleaf layer fits in
cache, because in heavy create, delete and lookup loads it is likely that the
entire layer will be in the working set. Create and delete will also tend to
dirty all the cached pleaf blocks continuously, causing continuous writeout,
and therefore, continuous block forking which may as much as double the cache
This is particularly problematic given Linux's habit of limiting dirty cache
to substantially less than total cache. As an example, consider a billion
entry directory, which requires roughly 20 GB of directory entry blocks (20
bytes per dirent) plus 10 GB of pleaf blocks (8 bytes per entry and 25% slack
space on average) plus about 55 MB for the Htree part of the index. The cache
load is dominated by dirty pleaf blocks, because dirent blocks are accessed
sequentially giving a working set of one. Holding 20 GB of dirty blocks in
Linux's restricted cache is a lot even for a high spec machine. The probable
result is that mass operations drop from cache update speed to media update
speed, which may be orders of magnitude less.
This is thrashing. A first order solution is to thrash as fast as possible, by
writing linearly and reading linearly. PHTree is much better than HTree
because the pleaf layer is an order of magnitude smaller than the inode table
block working set that HTree thrashes. Another clear advantage for PHTree is,
only mass create and delete will thrash. Usually, mass create at this scale is
only seen in benchmarks and delete could easily be backgrounded. Mass loads
that do change just a few directory entries and mainly access or update the
file content or attributes will exhibit excellent cache performance.
We may be able to improve the thrashing situation considerably...
Free Record Search
All the blocks of the phtree are indexed by a overloading the maximal hash
value, which forming a partition of btree with dirent indexes on the left
and (a very narrow sliver by comparison) block indices on the right. In the
upper level HTree nodes, the two varieties are indistinguishable, but in the
pleaf layer they are distinguished by overloading the "collision" bit of the
entry, which otherwise serves no purpose because each pleaf entry references a
single block in which an dirent matching the pleaf entry's hash field is
guaranteed to be found.
A new entry is added to the block "partition" of the phtree each time the
directory is expanded with a new dirent block.
Maximum free dirent record size each subtree of the block partition of the
phtree is stored as "size tags" in otherwise unused bits of the block
pointers. (Note that upper level phtree index nodes shared by both the block
partition and index partition carry tags. For index nodes referencing only the
"left" partition of pleaf blocks, these bits are always zero.)
For size tags, the following invariant must be maintained across all index
updates: the size tag is never less than the actual largest free dirent in the
subtree. If this invariant is violated, then free records may leak.
Size tags are maintained lazily, hence they may also be called "lazy tags".
This means that the actual largest free record of a subtree may be less than
the tag indicates, therefore a search for space in the subtree may fail. Each
time a search in a subtree fails, the size tag of the subtree root pointer is
updated to the actual largest free record size. (Optionally, this could be
propagated up the access path, but it is not clear that any efficiency
would be gained.) It is expected that successful searches greatly outnumber
failed services, with consequent tag updates. After a failed search, the
search continues somewhere else, usually the in the next sibling tree to the
right, having ensured that the same subtree will not be searched again for the
same record size, by updating the tag.
Size tags are recursive, so a search for available space of a given time is
O(tree_depth), unless many failure are encountered. It is hoped that the
average case remains very new O(tree_depth), even considering failures and tag
updates, which are supposed to be rare. If this proves not to be the case,
then the tag update algorithm can be made less lazy
Size Tag Encoding
For this application, it is clear how tags should be encoded. Dirent
allocation granularity is four bytes, so the encoding is in those units. The
smallest value is zero and the size required for the largest possible name is
(255 >> 2). So 64 distinct values are required for the encoding, or six bits.
Besides encoding the maximum record size, we may want to record something
about the allocation density of each block. This does not need to be exact, a
logarithmic scale would be good enough for the kinds of space optimization
algorithms we envisage. One idea is to encode 0..127 exactly, and 128..max
are powers of two, e.g.:
space = code < 128 ? code : (1 << code - 121)
Probe to the leftmost dirent block that matches the hash of the name. Search
the block for the name. If not found, pop the access path until not at the end
of an index table, then if the next entry does not reference the same hash,
fail the search. Otherwise, advance the path position by one entry. Probe back
to the dirent level. Repeat the search and advance steps until the entry is
Find a dirent block with space for a new dirent record of the required size.
Use the pointer tags to accelerate the search. Otherwise, probe recursively
through the levels following the first entry at each level that is tagged as
having a sufficiently large free record. If an index block has no tags
indicating sufficient space or the dirent level is reach and a scan shows
insufficient space, pop the access path, reset the tag to indicate actual
largest entry found, and advance through the index table to the first entry
tagged with sufficient space. If the end of the root table is reached, then
append a new, empty dirent block to the end of the directory.
Having found or created a dirent block with sufficient space, create the entry.
Then add a pleaf index entry. Probe to the pleaf block that contains the
(?)last entry exactly matching the hash. If the pleaf is full, split it and
recursively walk up the path splitting upper level blocks that reference the
original leaf. Stop when an index block does not need to be split. Add the new
entry to the appropriate pleaf block.
Probe for the name. Delete the name from the dirent block found. if the free
record thus obtained (possibly including records above and below the deleted
record) is larger than the free tag of the parent block in the path, reset the
free tag to the new size and continue doing this up the access path as far as
necessary. Using the access path, remove the index entry from the pleaf. If
the pleaf is now empty, continue up the access path removing index entries
until an index block is not empty.
As in HTree, compute the hash and descend through depth count levels as
recorded in root. At each level, binsearch through N-1 keys that separate N
block pointers to subtrees to position at the entry with hash key less than or
equal to the target hash. Perform the same search at the pleaf level but now
the key must match one entry exactly, or a group of equal entries, with
pointers to dirent blocks. Search the one or more dirent blocks for the target
dirent. If the entry found in the lowest level htree index block is followed
by a continuation hash, advance to the next pleaf block and continue searching
through dirent blocks until the entry is found.
More information about the Tux3