Design note: ENOSPC again
Daniel Phillips
daniel at phunq.net
Sat May 10 00:18:19 PDT 2014
Out of space handling is a topic of consuming interest to a handful of
filesystem developers in the world, and otherwise not very interesting
except when it does not work properly. Unlike the database world, this
topic is not well researched, so we filesystem developers are left pretty
much on our own to come up with solutions that are robust, efficient and
relatively simple.
Linus stated what is required quite clearly:
https://lkml.org/lkml/2009/3/25/632
"The undeniable FACT that people don't tend to check errors from
close() should, for example, mean that delayed allocation must still track
disk full conditions"
This means that ENOSPC must be returned immediately by write, link, etc,
which in turn means that our asynchronous frontend is required to predict
ENOSPC in the backend. This is the de facto Linux standard, more stringent
than Posix. An exact prediction is most probably equivalent to the halting
problem, so we aim for a conservative prediction where ENOSPC may be
returned as a false positive. Our challenge is to impose a bound on false
positives such that each ENOSPC return really does correspond to nearly
complete space exhaustion. This has been problematic for Linux filesystems
to say the least, and is an area we must treat particularly carefully in
Tux3. Besides being accurate, we want nearly zero CPU overhead, and we want
reasonable latency: if a syscall returns ENOSPC, it should do so promptly.
The algorithm I describe here is lightweight, adding just a single atomic
compare per filesystem update. It relies on plugging in an ad hoc "safety
factor", which is not rigorous, but neither is kernel stack overflow or
emergency memory reserve accounting, which nevertheless seem to work pretty
well. Our immediate goal is for our out of space prediction to fail
approximately never, as opposed to exactly never. In case our prediction
does fail, we will detect and report that failure right away with a cross
check so that the algorithm can be improved.
Algorithm
The central idea of this algorithm is to reduce our delta size as we run
out of volume space, so that our safety factor ends up being applied to
just one single filesystem update transaction instead accumulating over all
transactions in a delta. In other words, this algorithm removes the
typically large "updates per delta" multiplier from the out of space
prediction.
Tux3 update transactions run in potentially many frontend tasks,
asynchronous to the backend where all block allocations take place. Each
update transaction is enclosed by "change_begin" and "change_end".
Change_begin obtains a handle for the current delta, then change_end
releases it and schedules a commit if not already pending. Change_begin
runs directly in the call chain from the syscall through the vfs, so if it
detects and returns ENOSPC, userspace sees that immediately. A normal
return is a guarantee by Tux3 that the update will be successfully
committed to media, short of unexpected interruption.
Change_begin is already implemented in the necessary places, that is,
guaranteeing update atomicity and predicting ENOSPC go together. Now,
change_begin can return ENOSPC, whereas before it always succeeded. A
"cost" parameter is added, which is an overestimate of additional blocks
required to commit the delta. This includes the safety factor. For now, we
just plug in an ad hoc cost for each type of update. After the actual
allocations are done for each delta we check whether our overestimate was
too low. If so, we log a warning and carry on. At that point we risk
hitting ENOSPC in the backend with unpleasant results, so our measure of
success is, we never see those warnings and therefore never fail a backend
allocation.
The crux of this is the algorithm in change_begin. We have a per delta
"budget" which is initially equal to total free blocks less a global
reserve that I will detail below. Each change_begin subtracts its cost from
the budget, and if this does not go negative then the transaction may
proceed. If it does go negative, change_begin does not immediately return
ENOSPC because this could be due to our safety factor accumulating over a
lot of transactions. Instead, change_begin backs out its cost, schedules a
delta commit and waits for "delta staging" to complete, where all the
allocations are done. After staging, the backend knows how many blocks were
actually allocated, so it resets the budget accordingly and wakes up any
waiters. When change_begin wakes up, it tests its cost against the budget
again, and returns ENOSPC only if the cost exceeds the entire budget.
This algorithm has some attractive properties. In the common case, the CPU
cost is just a single atomic subtract. There is no new locking. It is well
localized, therefore easy to audit. It reports ENOSPC conservatively,
depending only on the cost estimate for a single transaction. Provided that
the cost estimate is truly an overestimate, it never allows a transaction
to proceed with insufficient space. On the negative side, this algorithm is
not fair: while some tasks are busy waking up, a new task could come in and
grab the new budget, causing starvation. This probably doesn't matter, but
if it turns out to be an issue, a fair queuing strategy would have only
slightly more overhead. Finally, it is not rigorous. It should work well,
but a long term goal will be to improve it to be able to prove that it
always works.
Unify debt
The "unify debt" reserve I referred to above is due to our delta/unify
design, which avoids per delta updates to bitmap blocks and some other
metadata by logging change records instead. A periodic unify flushes this
dirty metadata to media. The "debt" is an overestimate of free blocks
needed by the next unify. After delta staging, we know exactly how many new
metadata blocks are held dirty in cache (this is related to the number of
log entries emitted). We reuse the same overestimate/reset pattern as for
change_begin: we overestimate the unify cost by multiplying the dirty
metadata increase by a safety factor and increase the debt by that much.
After the next unify, the debt is reset according to the actual number of
metadata blocks held dirty in cache as a result of the unify.
Any potential increase in unify debt must be included in the begin_change
cost estimate. We do not estimate that separately, but simply include it in
the cost safety factor. The debt estimate increases each delta and is reset
to a value based on actual measurement at unify. Like delta cost, it is
easy to include a cross check that the debt estimate is never less than the
actual unify cost.
Cost estimates
Each kind of update transaction has its own cost estimate, which for now is
the number of per-delta blocks potentially changed by the update, times the
safety factor. For write, each page is a separate transaction, so the
calculation is pages_per_block * safety_factor, except for the first
update, which also account for updating the inode. For link, an inode table
block and a directory block are changed. A move updates up to two directory
blocks and three inodes, so the estimate is 5 * safety_factor.
Unify debt is baked into the safety factor and accounts for most of it. For
now, we assume that every changed block will be a new allocation, that
every allocation will update a different bitmap block, and that every
updated block or inode requires a btree path to be updated from leaf to
root, with potential splits. Maximum btree depth depends on volume size, so
the safety factor should be a variable set at mount. For very small volumes
the number of bitmap blocks affected is a constant 1.
Over time we will improve our methods of determining the safety factor and
ideally replace it entirely with provable bounds implemented by various
means. For now, if this hoc technique proves to meet our robustness
requirements in the real world then this task is under control. The
backbone of the algorithm that adaptively shrinks delta size to a single
transaction will most probably stand the test of time.
Some updates must always succeed
File truncation is a special case: returning ENOSPC on truncate could foil
attempts to recover space on a full volume. Truncation may require updating
an unbounded number of bitmap blocks, so something must be done about it.
Fortunately, truncation can be backgrounded and be a task that runs over
multiple deltas. We have not done this yet, it is another subproject. Such
a background task framework will be useful for other reasons and maps well
to our design.
Likewise, we must be sure that unlink always succeeds. It is sufficient to
provide a global reserve large enough for a single unlink, provided that we
serialize unlinks under ENOSPC conditions. If a change_begin for an unlink
reports ENOSPC, the caller can serialize on a wait queue so that one unlink
may complete per delta. This is another subproject.
Incremental improvements
As far as I can see, this strategy is sufficient for a solid out of space
strategy that will fail transactions reliably and not too early. However,
there are opportunities for improvement by replacing magic safety factors
with provable bounds and tightening up the cost estimates so that more of
the last few volume blocks can be usefully allocated. For example, it would
be nice if the frontend could be sure about how many bitmap blocks will be
needed in advance, before the backend goes to work. Sometimes this is
obvious, as with a volume with less than 2**15 blocks, so only one bitmap
block can possibly be needed. With a huge volume, the frontend does not
know where the allocations will be, so must conservatively assume that each
block will be allocated in a different bitmap, a pessimistic estimate that
must be baked into the safety factor. However, the backend allocator can be
taught to obey a bound on bitmap blocks to dirty, by retaining log entries
from one unify to the next if necessary. Whether this is worth the trouble
is another question.
We can tighten our out of space analysis by implementing certain fallbacks.
For example, we can implement an incremental unify strategy capable of
retiring a single log record per unify. This is clearly possible because we
can roll over all log transactions but one to the next unify if we wish.
This gets easier when the number of log blocks drops to one, which it
should if number of updates per delta drops to one. A drawback is that code
complexity increases on paths that may be rarely exercised. Before we get
into that we should try to do as much as we can with simple changes to our
common code paths.
Minimum volume size
Our smallest supported volume size is determined by our out of space
strategy. We must set a reasonable limit for unusable volume space due to
conservative out of space accounting, say, 1/16 of a small volume. The out
of space reserve does not increase much for large volumes, so rapidly drops
to an insignificant fraction. We therefore need to consider what our
smallest volume size will be with our initial, crude safety factors, and
whether it is worth working to lower that number. I have in mind a smallest
volume size of one or two megabytes with 4K blocksize, with about 1/16th
reserved for out of space. That would be just 16 blocks, pretty tight. Some
nice reductions come into play: there is only a single bitmap block and a
maximum of 256 inodes and 256 blocks per file, which limits btree depth.
This reduces the unify debt, which allows a smaller safety factor. So we
should adjust the safety factor according to volume size, which is mainly
important for small volumes.
Supporting a tiny minimum volume size might be more than a nicety, it could
be useful for embedded devices. It is certainly nice for testing. Without
the out of space accounting, minimum volume size is less than 32K with 256
byte blocks. In case somebody has a use for such a tiny volume, we can
easily support it by providing a option to disable frontend out of space
prediction, and let the backend detect out of space as it does already.
Hitting out of space will turn the volume read-only and require a remount
to become writable again, so this is not suitable for general storage
applications but could be useful for tiny, read only volume.
Regards,
Daniel
More information about the Tux3
mailing list