Design note: ENOSPC again

Sat May 10 00:18:19 PDT 2014

Out of space handling is a topic of consuming interest to a handful of 
filesystem developers in the world, and otherwise not very interesting 
except when it does not work properly. Unlike the database world, this 
topic is not well researched, so we filesystem developers are left pretty 
much on our own to come up with solutions that are robust, efficient and 
relatively simple.

Linus stated what is required quite clearly:

    https://lkml.org/lkml/2009/3/25/632
    "The undeniable FACT that people don't tend to check errors from 
close() should, for example, mean that delayed allocation must still track 
disk full conditions"

This means that ENOSPC must be returned immediately by write, link, etc, 
which in turn means that our asynchronous frontend is required to predict 
ENOSPC in the backend. This is the de facto Linux standard, more stringent 
than Posix. An exact prediction is most probably equivalent to the halting 
problem, so we aim for a conservative prediction where ENOSPC may be 
returned as a false positive. Our challenge is to impose a bound on false 
positives such that each ENOSPC return really does correspond to nearly 
complete space exhaustion. This has been problematic for Linux filesystems 
to say the least, and is an area we must treat particularly carefully in 
Tux3. Besides being accurate, we want nearly zero CPU overhead, and we want 
reasonable latency: if a syscall returns ENOSPC, it should do so promptly.

The algorithm I describe here is lightweight, adding just a single atomic 
compare per filesystem update. It relies on plugging in an ad hoc "safety 
factor", which is not rigorous, but neither is kernel stack overflow or 
emergency memory reserve accounting, which nevertheless seem to work pretty 
well. Our immediate goal is for our out of space prediction to fail 
approximately never, as opposed to exactly never. In case our prediction 
does fail, we will detect and report that failure right away with a cross 
check so that the algorithm can be improved.

Algorithm

The central idea of this algorithm is to reduce our delta size as we run 
out of volume space, so that our safety factor ends up being applied to 
just one single filesystem update transaction instead accumulating over all 
transactions in a delta. In other words, this algorithm removes the 
typically large "updates per delta" multiplier from the out of space 
prediction.

Tux3 update transactions run in potentially many frontend tasks, 
asynchronous to the backend where all block allocations take place. Each 
update transaction is enclosed by "change_begin" and "change_end". 
Change_begin obtains a handle for the current delta, then change_end 
releases it and schedules a commit if not already pending. Change_begin 
runs directly in the call chain from the syscall through the vfs, so if it 
detects and returns ENOSPC, userspace sees that immediately. A normal 
return is a guarantee by Tux3 that the update will be successfully 
committed to media, short of unexpected interruption.

Change_begin is already implemented in the necessary places, that is, 
guaranteeing update atomicity and predicting ENOSPC go together. Now, 
change_begin can return ENOSPC, whereas before it always succeeded. A 
"cost" parameter is added, which is an overestimate of additional blocks 
required to commit the delta. This includes the safety factor. For now, we 
just plug in an ad hoc cost for each type of update. After the actual 
allocations are done for each delta we check whether our overestimate was 
too low. If so, we log a warning and carry on. At that point we risk 
hitting ENOSPC in the backend with unpleasant results, so our measure of 
success is, we never see those warnings and therefore never fail a backend 
allocation.

The crux of this is the algorithm in change_begin. We have a per delta 
"budget" which is initially equal to total free blocks less a global 
reserve that I will detail below. Each change_begin subtracts its cost from 
the budget, and if this does not go negative then the transaction may 
proceed. If it does go negative, change_begin does not immediately return 
ENOSPC because this could be due to our safety factor accumulating over a 
lot of transactions. Instead, change_begin backs out its cost, schedules a 
delta commit and waits for "delta staging" to complete, where all the 
allocations are done. After staging, the backend knows how many blocks were 
actually allocated, so it resets the budget accordingly and wakes up any 
waiters. When change_begin wakes up, it tests its cost against the budget 
again, and returns ENOSPC only if the cost exceeds the entire budget.

This algorithm has some attractive properties. In the common case, the CPU 
cost is just a single atomic subtract. There is no new locking. It is well 
localized, therefore easy to audit. It reports ENOSPC conservatively, 
depending only on the cost estimate for a single transaction. Provided that 
the cost estimate is truly an overestimate, it never allows a transaction 
to proceed with insufficient space. On the negative side, this algorithm is 
not fair: while some tasks are busy waking up, a new task could come in and 
grab the new budget, causing starvation. This probably doesn't matter, but 
if it turns out to be an issue, a fair queuing strategy would have only 
slightly more overhead. Finally, it is not rigorous. It should work well, 
but a long term goal will be to improve it to be able to prove that it 
always works.

Unify debt

The "unify debt" reserve I referred to above is due to our delta/unify 
design, which avoids per delta updates to bitmap blocks and some other 
metadata by logging change records instead. A periodic unify flushes this 
dirty metadata to media. The "debt" is an overestimate of free blocks 
needed by the next unify. After delta staging, we know exactly how many new 
metadata blocks are held dirty in cache (this is related to the number of 
log entries emitted). We reuse the same overestimate/reset pattern as for 
change_begin: we overestimate the unify cost by multiplying the dirty 
metadata increase by a safety factor and increase the debt by that much. 
After the next unify, the debt is reset according to the actual number of 
metadata blocks held dirty in cache as a result of the unify.

Any potential increase in unify debt must be included in the begin_change 
cost estimate. We do not estimate that separately, but simply include it in 
the cost safety factor. The debt estimate increases each delta and is reset 
to a value based on actual measurement at unify. Like delta cost, it is 
easy to include a cross check that the debt estimate is never less than the 
actual unify cost.

Cost estimates

Each kind of update transaction has its own cost estimate, which for now is 
the number of per-delta blocks potentially changed by the update, times the 
safety factor. For write, each page is a separate transaction, so the 
calculation is pages_per_block * safety_factor, except for the first 
update, which also account for updating the inode. For link, an inode table 
block and a directory block are changed. A move updates up to two directory 
blocks and three inodes, so the estimate is 5 * safety_factor.

Unify debt is baked into the safety factor and accounts for most of it. For 
now, we assume that every changed block will be a new allocation, that 
every allocation will update a different bitmap block, and that every 
updated block or inode requires a btree path to be updated from leaf to 
root, with potential splits. Maximum btree depth depends on volume size, so 
the safety factor should be a variable set at mount. For very small volumes 
the number of bitmap blocks affected is a constant 1.

Over time we will improve our methods of determining the safety factor and 
ideally replace it entirely with provable bounds implemented by various 
means. For now, if this hoc technique proves to meet our robustness 
requirements in the real world then this task is under control. The 
backbone of the algorithm that adaptively shrinks delta size to a single 
transaction will most probably stand the test of time.

Some updates must always succeed

File truncation is a special case: returning ENOSPC on truncate could foil 
attempts to recover space on a full volume. Truncation may require updating 
an unbounded number of bitmap blocks, so something must be done about it. 
Fortunately, truncation can be backgrounded and be a task that runs over 
multiple deltas. We have not done this yet, it is another subproject. Such 
a background task framework will be useful for other reasons and maps well 
to our design.

Likewise, we must be sure that unlink always succeeds. It is sufficient to 
provide a global reserve large enough for a single unlink, provided that we 
serialize unlinks under ENOSPC conditions. If a change_begin for an unlink 
reports ENOSPC, the caller can serialize on a wait queue so that one unlink 
may complete per delta. This is another subproject.

Incremental improvements

As far as I can see, this strategy is sufficient for a solid out of space 
strategy that will fail transactions reliably and not too early. However, 
there are opportunities for improvement by replacing magic safety factors 
with provable bounds and tightening up the cost estimates so that more of 
the last few volume blocks can be usefully allocated. For example, it would 
be nice if the frontend could be sure about how many bitmap blocks will be 
needed in advance, before the backend goes to work. Sometimes this is 
obvious, as with a volume with less than 2**15 blocks, so only one bitmap 
block can possibly be needed. With a huge volume, the frontend does not 
know where the allocations will be, so must conservatively assume that each 
block will be allocated in a different bitmap, a pessimistic estimate that 
must be baked into the safety factor. However, the backend allocator can be 
taught to obey a bound on bitmap blocks to dirty, by retaining log entries 
from one unify to the next if necessary. Whether this is worth the trouble 
is another question.

We can tighten our out of space analysis by implementing certain fallbacks. 
For example, we can implement an incremental unify strategy capable of 
retiring a single log record per unify. This is clearly possible because we 
can roll over all log transactions but one to the next unify if we wish. 
This gets easier when the number of log blocks drops to one, which it 
should if number of updates per delta drops to one. A drawback is that code 
complexity increases on paths that may be rarely exercised. Before we get 
into that we should try to do as much as we can with simple changes to our 
common code paths.

Minimum volume size

Our smallest supported volume size is determined by our out of space 
strategy. We must set a reasonable limit for unusable volume space due to 
conservative out of space accounting, say, 1/16 of a small volume. The out 
of space reserve does not increase much for large volumes, so rapidly drops 
to an insignificant fraction. We therefore need to consider what our 
smallest volume size will be with our initial, crude safety factors, and 
whether it is worth working to lower that number. I have in mind a smallest 
volume size of one or two megabytes with 4K blocksize, with about 1/16th 
reserved for out of space. That would be just 16 blocks, pretty tight. Some 
nice reductions come into play: there is only a single bitmap block and a 
maximum of 256 inodes and 256 blocks per file, which limits btree depth. 
This reduces the unify debt, which allows a smaller safety factor. So we 
should adjust the safety factor according to volume size, which is mainly 
important for small volumes.

Supporting a tiny minimum volume size might be more than a nicety, it could 
be useful for embedded devices. It is certainly nice for testing. Without 
the out of space accounting, minimum volume size is less than 32K with 256 
byte blocks. In case somebody has a use for such a tiny volume, we can 
easily support it by providing a option to disable frontend out of space 
prediction, and let the backend detect out of space as it does already. 
Hitting out of space will turn the volume read-only and require a remount 
to become writable again, so this is not suitable for general storage 
applications but could be useful for tiny, read only volume.

Regards,

Daniel