Block Handles

Daniel Phillips phillips at phunq.net
Thu Aug 16 22:28:39 PDT 2012


This note describes a new mechanism we propose to try in Tux3. At the end of this work,
should we decide to undertake it, Tux3 should work the same as it does now, but slightly
more efficiently and with much clearer code paths, to provide a better base for some of
the more sophisticated improvements we have planned.

Block Handles

The goal is to replace all usage of struct buffer_head in Tux3 with mechanisms better
suited to the needs of a modern filesystem.

Summary of functionality by buffer_head:
  1. Act as a data handle for one block of disk data
  2. Lock a cached block for modification - lock_buffer/unlock_buffer
  3. Record buffer state - clean (uptodate), dirty, invalid (not uptdodate and not dirty), others
  4. Block IO transfer to/from media - bread, flush
  5. Cache physical block address - b_blocknr
  6. Part of API between VFS and a particular filesystem (address_space_operations.bmap)

Some of this functionality no longer requires buffer_head in any essential role. In particular
IO is now driven by struct bio and the traditional buffer oriented API is now layered on top
of that via the page cache. Physical block addresses may be cached in any way a particular
filesystem desires, and because this caching typically sits behind the page cache, repeated
cache hits are relatively rare to the point it is not obvious that physical block addresses
are worth caching at all, versus looking up the physical address in the cached metadata
whenever required. This is particularly true in modern redirect-on-write filesystems where
the physical address of a block normally changes each time it is written to disk. Finally,
the API role of buffer_head is relegated to the kernel block IO library rather than core
kernel, so that buffer_heads are required for interfacing only to library code that was
directly or indirectly called from the filesystem.

This situation (are we sure it is a fact?) opens the door to removing buffer_head entirely
from a given filesystem implementation and replacing it with alternate, hopefully
superior mechanisms.

The "block handles" API is proposed in for internal use in Tux3 to replace the first three
roles of buffer_head above in a more efficient, more flexible and hopefully more obviously
correct way. The functionality we need to capture in the new API is:

  1. Per block data handle
  2. Per block lock
  3. Per block state

Tux3 has multiple dirty states: the dirty state of a cache block is associated with the
"current" delta, the filesystem view provided via the VFS to the user, and possibly one
of several deltas in process of being transferred to disk. We would like to be able to
tell by looking only at the block state, in which delta, if any, the block is dirty. In
addition, not present (empty) and present (clean) states are represented.

Where block size is smaller than page cache page size, the state of each block cached
in a given page is completely independent. We therefore need an array of state variables.
Four bits is sufficient for Tux3 per-block state: one lock bit and three bits of scalar
state. This is true of both physically indexed ("volmap") and logically indexed (page
cache) data. In fact, the current architecture of Linux's buffer cache maps the
"physical" blocks of a volume into a page cache just like an ordinary file. This
implementation detail is hidden from the user behind the buffer API. However, with Tux3
we propose to avoid the buffer API entirely and deal directly with the page cache that
maps our physical volume.

Because the intent of Tux3 block handles is to obsolete the traditional list of
buffers attached to a cached page (and every direct or indirect use of buffers) the
page.private field may be used for the new per-page block state array. Here must
make a design decision: should the page.private field contain a pointer to a state
array, or should it directly contain a small state array mapped into the
pointer-sized page.private field? For now, we choose the latter strategy because it
suffices for current configurations we actually support, that is: 4K
word size with at most eight 512 byte blocks per page on a 32 bit architecture, or
sixteen 512 byte blocks on an 64 bit architecture with 4K page size, or 16 4K
blocks on a 64 bit architecture with 64K page size. Not every possible blocksize
configuration would be available on every Linux architecture, however we always
have 4K blocksize available on all architectures with this design, which may well
be sufficient for the purpose of architecture independence. Should we later decide
otherwise, the necessary changes to support all possible block size configurations
on all possible architectures are not difficult. The page.private field could
point at a state vector, or provide an offset within a state vector. This design
would addsa level of indirection to a hot code path and introduce an additional
requirement for memory management, neither of which is odious, but more complex
and less efficient that the current, more restrictive proposal.

Atomic State Change

For now, we assume that full atomicity is required for block handle state changes.
For example, a dirty to clean must guarantee not only that the correct transition
is made, but that it is made exactly one and that no intermediate states are visible
to any other cpu. For now, that means that the 3 bit scalar state is only modified
while the 1 bit (sleeping) lock is held:

 1. lock block
 2. update state
 3. unlock block

This is far from the most efficient possible lock design, but it is only slightly
worse than the current buffer_head locking scheme and it will do for now. We can
revisit this design after we understand all the locking scenerios required for Tux3.

Notes

  * A given block of a Tux3 volume may never be cached in both the physically indexed
volume cache and a logically indexed file page cache.

  * When file versioning becomes available it will become possible for a given block to
be cached in more than one file page cache, each cache belonging to a mount of a
separate version of the filesystem, but the block is only allowed to be dirty in one of
those caches.






More information about the Tux3 mailing list