Full volume sync performance
Daniel Phillips
phillips at phunq.net
Sat Dec 29 18:52:48 PST 2012
For your entertainment, I will indulge in a little speculation about our full
volume sync performance. For a small file update followed by volume sync,
Tux3's minimal commit is four blocks:
1) File data block
2) File data tree leaf (dleaf)
3) Inode table leaf (ileaf)
4) Commit block (metablock)
To reduce that to three blocks we may log the inode data attr instead of
writing out the dirty itable block, and to save another block we may log the
updated dleaf entry instead of writing out the dirty dleaf block. Or
alternatively, we may introduce a data extent attribute to replace the dleaf
entirely for a small file, and log that. Finally, we may introduce an immediate
data attribute to store a small file entirely in the inode table, and log that
instead of writing out an itable block. So even though a volume commit of
only four blocks is already excellent, we expect considerable improvement
over time.
To summarize, the theoretical minimal commit for a small file update is one
block, and Tux3 actually writes four blocks as of today. For Tux3, any commit
(we say "delta commit") is equivalent to a full volume sync. It is fair to say
that this compares favorably to existing filesystems, especially since Tux3
writes nondestructively and provides unusually strong consistency guarantees.
Now lets see how many blocks tux3 needs to commit for a new file write and
volume sync. This additionally updates a directory. The directory is a file, so
this load is like updating two files. Let us assume the directory is unindexed,
so only a single block is updated for a new dirent, and we log the redirected
directory into a common log block for the delta, so the number of log blocks
does not increase. This sync requires seven blocks:
1) File data block
2) File data tree leaf (dleaf)
3) Inode table leaf (ileaf)
4) Directory data block
5) Directory data tree leaf (dleaf)
6) Inode table leaf (ileaf)
7) Commit block (metablock)
So, three blocks for the file and three for the directory. However, with good
inum allocation, the directory and file inodes may well lie in the same itable
block, reducing the delta size to six blocks, and as in the single file update
case, this load can theoretically be reduced to a single commit block in
future.
As in the file update case, the main overhead for spinning disk is seeking, not
transfer time. And as before, we are likely to end up with all blocks except
the commit block on the same track. So we expect similar latency in the new
file write case, let us say about 10 ms, or 100 volume syncs per second for
this realistic scenario.
Effect of Rollup
Tux3 does periodic rollups to retire log blocks and prevent unbounded log
growth. Therefore, logging does not actually eliminate block updates, it
merely defers them in the hope of accumulating multiple changes per block.
(For spinning media, all the blocks in a rollup are written in media order,
saving seeks.)
As an additional benefit of the log+rollup strategy, average commit latency is
reduced. A rollup is just an extra step in the "marshal" process that prepares
front end deltas for atomic transfer to media. Therefore, a volume sync might
include a rollup, adding a large number of blocks to the delta. Arguably, we
should consider rollup blocks averaged across deltas per rollup, to obtain an
accurate average latency estimate. I will not do that here, except to note
that typical block counts of rollups tend to be modest, and the additional
media transfer overhead is not odious. I will quantify that later.
As a nicety, we might add a heuristic to delay an imminent rollup until after
the delta that a volume sync is waiting on. We have a lot of flexibility in
deciding when to do rollups. If we delay rollups, we pin more log blocks and
some disk free space, and increase log replay time, but those are the only
negative effects.
Volume Sync Benchmark
To test our volume sync performance we can write a very simple benchmark
script: just rewrite a small file and sync the volume many times. Now let me
try to predict the results for spinning disk. I will be optimistic. If real
measurements are even close to my prediction, we are doing well.
A typical consumer grade hard disk available for purchase online has 4.2 ms
average seek latency and 138 million bytes/second transfer rate, or 131 MB/s
expressed more fairly as binary megabytes. If we presume maximum latency for
each block written, that comes out to about 17 ms for a volume sync. However,
our minimal sync is probably faster than that, because it is likely that some
of those blocks will lie on the same track. In fact, if our allocator is good,
we should often have three of the four blocks on the same track, with only the
commit block (metablock) somewhere far away. Transfer time is negligible
compared to seek time at this scale, so our sync should be just a little more
than two average seek times, or less than 9 ms. This would give us 111 volume
syncs per second on a spinning disk, more than respectable. This would be much
higher for flash disk because nearly all the sync overhead for spinning disk is
due to seeks.
How big is a disk track?
Perhaps surprisingly, we can figure out how many bytes are stored on each disk
track even without help from the manufacturer. This is just the media transfer
rate divided by the rotation rate, or in this case:
138 MB/sec / 120 revs/sec = 1.1 MB/track.
What is the probability of being able to allocate all the small file blocks and
all the directory blocks for one small file on the same track? We don't know
where the track boundaries are, so we must additionally consider the
possibility of the delta spanning two tracks. For example, if we are able to
allocate all the blocks of a delta in the same 100 KB range, the probability
that the update will be split across two tracks is about one in ten. However,
spanning two tracks does not cost a full additional seek because the upper
part of the split may be written to media on the same rotation as the lower.
>From the above, it seems likely that our allocation policy will often be able
to allocate the 6 blocks in question within the same track, most of the time,
if we take care to leave some slack space distributed across the volume so
that redirected blocks need not be moved far away from their "home" target.
10% slack space seems reasonable, which would be reduced if a volume becomes
nearly full.
Future
In the future, Tux3 will log metadata more aggressively. In particular, we may
log namespace changes instead of committing full directory block per delta,
deferring the actual directory block writes to a rollup. This would be a
substantial win where there are multiple changes to the same directory, a
common load.
Regards,
Daniel
More information about the Tux3
mailing list