[Tux3] Thinking about Syncing
Daniel Phillips
phillips at phunq.net
Sun Oct 12 04:49:10 PDT 2008
However much we might want to wish them away, spinning disks are far
from dead as of this writing and are likely to continue to annoy us for
another decade or more before eventually receding into a largely
archival role. But even in such a happy world, we cannot just forget
about efficient access to our archives, and even though it will tend to
occur over a longer time scale, we still will need to worry about
controlling fragmentation as we slowly churn the data in our near-line
storage archives. In other words, spinning media with all its issues
will never actually die, it will just slowly fade away.
This is all by way of saying that Tux3 must work well with rotating
media if it is to establish a foothold as a general purpose filesystem.
And also by way of saying that for the next few weeks I expect to spend
pretty much all my Tux3 design energy wallowing in this one obsession.
On the bright side, the problems we need to grapple with can get quite
involving and subtle for those whose interest runs in that direction.
My immediate concern is how to get a massive number of buffers dirtied
by filesystem activity onto disk quickly and safely. By "safely" I
mean ACID:
A)tomic
- Either all the disk updates for a given filesystem operation
become part of the stable disk image or none of them do.
C)onsistent
- There is no instant in time when an inconsistent filesystem
image is recorded on disk.
I)solated
- Even though we did not write data to disk in exactly the order
the filesystem operations occurred, it seems like we did.
D)urable
- After fsync of a file, any data written to the file is safe
from power loss.
By "quickly" I mean something resembling media speed, even with lots of
small files. The ideal write pattern will move the disk head linearly,
progressively depositing a mixture of blocks that have been recently
dirtied in cache by filesystem operations. Untarring a 2.6.26.5 kernel
tree to a Tux3 filesystem will yield the following mix:
- File data blocks (56%)
- File index btree root and leaf (43%)
- Inode table blocks (.3%)
- Dirent blocks (.2%)
- Commit blocks (.1%)
The commit blocks make up our "forward log" and logically state how each
of the other blocks is to be inserted into the filesystem tree, without
actually performing any insertions. To prevent log chains from
becoming so long that a significant amount of time is required to
reconstruct the cache state on filesystem startup or resume from crash,
we will perform periodic rollups, where our logical log "promises" are
crystallized as actual index block updates, performed atomically of
course.
How often we must perform such rollups is a good question. I will
arbitrarily assume we do it once a second in the case of a heavy untar
load, at which time we need to insert about 50 pointers to inode table
blocks, 40 pointers to dirent blocks and sometimes a pointer to a new
bitmap block. These are all nicely localized, most probably affecting
one inode table index block, one directory file index block and one
bitmap file index block. Together with the one bitmap block we have
dirtied, that is just four blocks to update. To update these blocks
atomically, we write them to a location of our choosing, which could
possibly be in line with our stream of data blocks, or perhaps more
optimally for read performance, a few tens of tracks back towards the
beginning of the disk. Supposing we choose the latter, we can expect
to spend 10 ms or so seeking to our chosen location, writing the
handful of index blocks and a commit block to keep track of them, and
back to where we will deposit the next second's worth of untarred
files. This 10 ms out of line write once per second (.1%) will
scarcely affect out write overhead at all.
The commit block format can be very simple. For each data block in the
transaction, including dirent blocks, we log the triple: {data block
address, btree leaf block address, index offset}. For each inode table
block we log the triple: {inode table block address, inode table index
leaf address, index offset}. These log entries fit nicely in 16 bytes
each, allowing nearly 256 per commit block, which defines the maximum
size of a transaction at this level. By implication, all the object
blocks owned by the transaction are newly allocated, so this fact does
not have to be explicitly recorded.
The commit block also records the forward log chain link, which will
cause no measurable overhead unless we implement a paranoid mode that
does not rely on checksumming to know when a commit has successfully
reached disk, in which case we will need some periodic out of line
seeks to record log chain commits at predetermined locations. Under
the heavy untar load being considered here there is no reason to do
this more than about once a second, and the 12 ms or overhead will not
affect performance a lot.
Let's take a step back now and consider the marvelous way that we really
do not have to do much of anything in order to achieve nearly perfect
linear write behavior for the massive untar case, complete with atomic
commit. It may all seem obvious in retrospect, but it took years to
arrive at this level of simplicity. The key insights are:
* We can write data and index blocks anywhere we want.
* We do not need to write our transaction log at any predetermined
position.
* We do not have to write the parent pointer to a block at the time
we write the block, we just need to log a promise to write the
parent pointer at some time in the future.
In the analysis above, we wastefully write out a btree index root and
leaf for each file, whereas the Tux3 design allows for direct pointers
to data blocks from the inode table block, for small files. This
optimization will be postponed until after the initial kernel port,
since it is not very interesting from the algorithm point of view. But
when we do get around to it, our untar load will take a dramatic turn
for the better:
- File data (99%)
- inode table blocks (.6%)
- dirent blocks (.4%)
- commit blocks (.1%)
In other words, we expect to approach media speed closely for this
particular load, which is likely going to be the first test users will
try when they get their hands on Tux3. We will do even better when
immediate file data support is added, where file data is stored
directly in the inode table block if possible, which saves space and
transfer time by packing small files together, thus reducing external
fragmentation.
In this analysis I have glossed over the possibility of losses due to
bio setup overhead, driver overhead, cpu latency in the filesystem
itself and locking. We do expect such losses, probably in the low
single digit percentages. With this in mind I think that we can
realistically aim at 90%+ of media speed for the untar case, which if
achieved will be considerably better than is possible for a journalling
filesystem.
There are far more challenging write loads we also need to deal with,
such as rewrite of an untar tree. Ext3, whose write performance is far
from shabby, takes longer to rewrite a kernel untar by a factor of
about 2.3. Tux3 has the additional challenge that if a snapshot has
been set in the interim it has to find an entirely new place to store
the kernel untar that hopefully will not cause horrible fragmentation
if we repeat the process over and over again. Maybe if we are clever
in our choice of target location for the rewrite, we can get about the
same performance as the virgin untar, which would be wonderful.
Then there is the random write case, slow growing log case, slow growing
directory case, completely random writes, and a whole lot of other
nasty write performance optimization puzzles that will rear their heads
as we progress. Sigh. But now we have a pretty good idea how the
thundering herd of small files case will work, and that is looking very
good. Solutions for the other, more subtle cases will be developed
from the basic model described here.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list