Design note: ENOSPC (draft)

Tue Jul 30 00:18:58 PDT 2013

ENOSPC handling

Before any end user can realistically use Tux3, we need to have reliable,
noncorrupting and unsuprising behavior when a volume fills up. All attempts
to exceed free space must be detected in the frontend before allowing any
update transaction to modify cache.

Failing to predict allocation failure in the frontend and failing instead in
the backend low level alloc is a critical bug, because commit can no longer
guarantee an ACID update. If such a bug occurs then a kernel error
message will be logged and the filesystem will be put into read-only mode.

ENOSPC is a nasty issue in general, particularly for new generation
filesystems using more complex metadata structures and delayed allocation.
No filesystem can be judged ready for general use without provably robust
ENOSPC handling. On the bright side, Tux3 has some advantages:

   1) Relatively simple metadata structure: easier pessimal bound
   2) Logging means no recursive split to root: relatively tight bound
   3) Delta update model: space reservation begins new each delta
   4) Can adjust delta size as volume fills: tighten the bound

We combine these to implement a novel

The central idea of our ENOSPC strategy is to use smaller deltas as a
volume approaches full. When there is plenty of free space available we
can use large deltas and a loose, efficient computation to place a safe
upper bound on backend free space consumption, but when space is tight then
we can tighten the bound simply by permitting fewer transactions pre delta.

Additionally, as free space decreases, we could contemplate the option of
doing more work to compute tighter bounds, for example by examining
metadata to see whether certain space-consuming corner cases may actually
apply to a given transaction. For perfect accuracy, we could even fall
back to synchronously performing metadata updates - but not IO
submission - in cache to remove all doubt about whether backend allocation
will succeed or not, together with some means of discarding a failed
update. We expect that such elaboration will prove unnecessary in practice.
Simply shrinking the delta size as necessary and estimating allocation
bounds sensibly will most probably be sufficient to achieve good behavior
under near full conditions.

ENOSPC Algorithm

Check for ENOSPC in frontend before altering any cache by estimating the
transaction allocation footprint. We check efficiently when free space is
large by accumulating a "delta footprint" (a loose upper bound on allocation
requirements for the entire delta) per transaction. If the delta footprint
exceeds free space less some "soft" threshold then we trigger a delta
transition. If the delta footprint is less than free space then ENOSPC is
impossible, so proceed with the transaction. Otherwise, wait on delta
completion in case some pinned blocks may be freed, then if this transaction
by itself exceeds free space, abort with ENOSPC. Otherwise repeat. (Note
the possibility of infinite recursion, do something.)

Transaction bound estimation details

The intention is that for most transactions, a robust upper bound on actual
backend allocation can be computed with a small amount of arithmetic on
readily available quantities, with imperceptible overhead in the common case
that the transaction succeeds.

Unify Debt

As an additional complexity, we need to track the "unify debt", that is,
upper bound on blocks needed to retire the log. Try to do this in backend,
not frontend, then bake the bound into some simple per transaction
accounting.

Log footprint

  Reserving a full log block per potential log entry would be too loose by a
factor of hundreds. Instead, accumlate estimated log entries up to size of
one log block, then increment footprint blocks and clear the log entry
accumulator.

Update transaction types:

Attribute update

  Assume "stock" inode size. Assume itree will split: +1 block, +1 log
entry.

Data operations

Each data update implies attribute update.

write

  Assume leaf split, so +1 dleaf, then +n dleaf blocks assuming one data
block per extent, the pessimal fragmentation case. +n log entries.

truncate

  +1 dleaf modified at the truncation point. +n log entries assuming one
entry per file block past the truncation point. We should accumulate the
actual block count in the inode for this and for fstat.

xattr write

  +1 itree block for attr already covers attr body. +3 atable data blocks,
+3 atable dtree blocks, +6 log entries

xattr delete

tighten bound by remembering which atable block is updated

Namespace operations

Update one more more directory files and one or more inodes.

Shard index

  We probably want to use a log/unify strategy for index updates to remove
  latency from frontend and reduce the pessimal footprint (becomes unify
  unify debt instead).

Name create

  Estimating a reshard per dirent add would be too crude. Instead, accumulate
creates and estimate one reshard, plus additional reshards by dividing
total creates.

Operations

link

  Changes a dirent block or adds a new one, updates directory attributes,
  updates or adds a shard tail block, potentially triggering a reshard,
  updates a free tag, updates target inode link count.

  +inode dir, +inode old, +1 dirent block, +1 shard tail, +reshard.
  <to be continued>

Unlink

  Changes a dirent block, updates directory attributes,
  updates or adds a shard tail block, potentially triggering a reshard,
  updates a free tag, updates target inode link count.

  +inode dir, +inode old, +1 dirent block, +1 shard tail, +reshard.
  <to be continued>

create

  Creates a new inode and adds a directory entry

  +link, +inode new, +1 dirent block, +1 shard tail, +reshard.
  <to be continued>

delete

  Deletes a directory entry and potentially deletes an inode

  +1 dirent block, +1 shard tail.
  <to be continued>

Rename
  Same footprint as unlink followed by link

  +unlink, +link
  <to be continued>