[Tux3] Design note: Delta staging
Daniel Phillips
phillips at phunq.net
Thu Dec 4 21:42:10 PST 2008
In a previous design note, I described Tux3's frontend/backend cache
model:
http://mailman.tux3.org/pipermail/tux3/2008-November/000303.html
"Deferred Namespace Operations"
The subject line is a little inaccurate - that post is really more about
Tux3's cache layering, out of which came the conclusion that the
concept of deferred namespace operations is necessary in order to
implement Tux3's layered model optimally. (The above needs a cleanup
and repost as a design note specifically about the cache model.)
This post is about an important piece of the cache layering model:
the "delta staging" operation (called "delta setup" in the post above)
which takes place at each delta transition. Delta staging takes the
changes that the user has made to front end cache and formats them into
disk blocks, ready to be transferred to disk. Since these disk blocks
are all buffered in cache, this is a cache-to-cache operation.
Delta staging does the following:
- Flushes deferred name creates and deltas to directory blocks (*)
- Flushes deferred inode creates and deletes to inode table blocks (*)
- Allocates and assigns physical disk addresses for dirty data blocks
and dirty directory entry blocks
- Updates inode table blocks to point to new locations of the above
blocks
- Assigns new locations for any modified inode table blocks
- Assigns new locations for split index blocks (note: but not for
index blocks that just have a new pointer added, that is handled
by "promises" and rollup described earlier.)
- Allocates and formats one or more log commit blocks to record the
physical locations of all blocks now contained by the delta,
including dirty file data, changed directory entry blocks and split
btree nodes
- Chooses one of the log commit blocks to be the delta commit block
and adds delta commit information to it.
Steps marked (*) above are done only if we have deferred namespace
operations available. Otherwise, the front end will directly modify
directory entry and inode table blocks, using a lock against delta
staging to avoid the situation where the front end wants to modify an
inode table block that delta staging has already modified but not yet
written.
Note that a delta does not include changed bitmaps or changed index
blocks, other than split index blocks. This information can be derived
from the delta commit blocks at replay time (next mount, whether
explicit or after unexpected interruption) so there is no need to write
this to disk.
Delta staging will normally be a quick, cache to cache operation with
running time measured in microseconds. However, if it needs to read
metadata blocks from disk then its running will sometimes run into
milliseconds. With deferred namespace operations, this will causes no
visible interruption to the front end, provided the disk is not
backlogged. Without deferred namespace operations, these latency
spikes will be visible to the user, which sounds bad, but is actually
normal behavior for the current generation of Linux filesystems. With
deferred namespace operations, we will push the envelope a little in
Tux3.
When delta staging completes, we have a set of block images ready to
transfer to disk. As soon as the previous delta has completed (that
is, its delta commit block write has completed) all the blocks of the
new delta are submitted for writeout, except for the delta commit
block. When the other blocks have completed writeout, the delta commit
block is written and transfer of the next delta can begin.
This is a pretty simple strategy. In time we will will elaborate it
with measures to eliminate the two synchronous waits at each delta: 1)
the commit block waits for the other block writeouts to complete; 2)
the next delta waits for the commit block writeout to complete. These
waits can be eliminated with some simple tricks. Even if we are lazy
and do nothing about these waits, performance will be respectable,
because under load the deltas will be quite large. Ten milliseconds
spent waiting for a 500 millisecond delta writeout will be scarcely
noticeable. But a synchronous write load, like we have with NFS, will
benefit visibly from a smarter commit pipeline strategy.
More on that later.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list