Full volume sync performance

Sat Dec 29 18:52:48 PST 2012

For your entertainment, I will indulge in a little speculation about our full 
volume sync performance. For a small file update followed by volume sync, 
Tux3's minimal commit is four blocks:

   1) File data block
   2) File data tree leaf (dleaf)
   3) Inode table leaf (ileaf)
   4) Commit block (metablock)

To reduce that to three blocks we may log the inode data attr instead of 
writing out the dirty itable block, and to save another block we may log the 
updated dleaf entry instead of writing out the dirty dleaf block. Or 
alternatively, we may introduce a data extent attribute to replace the dleaf 
entirely for a small file, and log that. Finally, we may introduce an immediate 
data attribute to store a small file entirely in the inode table, and log that 
instead of writing out an  itable block. So even though a volume commit of 
only four blocks is already excellent, we expect considerable improvement  
over time.

To summarize, the theoretical minimal commit for a small file update is one 
block, and Tux3 actually writes four blocks as of today. For Tux3, any commit 
(we say "delta commit") is equivalent to a full volume sync. It is fair to say 
that this compares favorably to existing filesystems, especially since Tux3 
writes nondestructively and provides unusually strong consistency guarantees.

Now lets see how many blocks tux3 needs to commit for a new file write and 
volume sync. This additionally updates a directory. The directory is a file, so 
this load is like updating two files. Let us assume the directory is unindexed, 
so only a single block is updated for a new dirent, and we log the redirected 
directory into a common log block for the delta, so the number of log blocks 
does not increase. This sync requires seven blocks:

   1) File data block
   2) File data tree leaf (dleaf)
   3) Inode table leaf (ileaf)
   4) Directory data block
   5) Directory data tree leaf (dleaf)
   6) Inode table leaf (ileaf)
   7) Commit block (metablock)

So, three blocks for the file and three for the directory. However, with good 
inum allocation, the directory and file inodes may well lie in the same itable 
block, reducing the delta size to six blocks, and as in the single file update 
case, this load can theoretically be reduced to a single commit block in 
future.

As in the file update case, the main overhead for spinning disk is seeking, not 
transfer time. And as before, we are likely to end up with all blocks except 
the commit block on the same track. So we expect similar latency in the new 
file write case, let us say about 10 ms, or 100 volume syncs per second for 
this realistic scenario.

Effect of Rollup

Tux3 does periodic rollups to retire log blocks and prevent unbounded log  
growth. Therefore, logging does not actually eliminate block updates, it 
merely defers them in the hope of accumulating multiple changes per block. 
(For spinning media, all the blocks in a rollup are written in media order, 
saving seeks.)

As an additional benefit of the log+rollup strategy, average commit latency is 
reduced. A rollup is just an extra step in the "marshal" process that prepares 
front end deltas for atomic transfer to media. Therefore, a volume sync might 
include a rollup, adding a large number of blocks to the delta. Arguably, we 
should consider rollup blocks averaged across deltas per rollup, to obtain an 
accurate average latency estimate. I will not do that here, except to note 
that typical block counts of rollups tend to be modest, and the additional 
media transfer overhead is not odious. I will quantify that later.

As a nicety, we might add a heuristic to delay an imminent rollup until after 
the delta that a volume sync is waiting on. We have a lot of flexibility in 
deciding when to do rollups. If we delay rollups, we pin more log blocks and 
some disk free space, and increase log replay time, but those are the only 
negative effects.

Volume Sync Benchmark

To test our volume sync performance we can write a very simple benchmark 
script: just rewrite a small file and sync the volume many times. Now let me 
try to predict the results for spinning disk. I will be optimistic. If real 
measurements are even close to my prediction, we are doing well.

A typical consumer grade hard disk available for purchase online has 4.2 ms 
average seek latency and 138 million bytes/second transfer rate, or 131 MB/s 
expressed more fairly as binary megabytes. If we presume maximum latency for 
each block written, that comes out to  about 17 ms for a volume sync. However, 
our minimal sync is probably faster than that, because it is likely that some 
of those blocks will lie on the same track. In fact, if our allocator is good, 
we should often have three of the four blocks on the same track, with only the 
commit block (metablock) somewhere far away. Transfer time is negligible 
compared to seek time at this scale, so our sync should be just a little more 
than two average seek times, or less than 9 ms. This would give us 111 volume 
syncs per second on a spinning disk, more than respectable. This would be much 
higher for flash disk because nearly all the sync overhead for spinning disk is 
due to seeks.

How big is a disk track?

Perhaps surprisingly, we can figure out how many bytes are stored on each disk 
track even without help from the manufacturer. This is just the media transfer 
rate divided by the rotation rate, or in this case:

    138 MB/sec / 120 revs/sec = 1.1 MB/track.

What is the probability of being able to allocate all the small file blocks and 
all the directory blocks for one small file on the same track? We don't know 
where the track boundaries are, so we must additionally consider the 
possibility of the delta spanning two tracks. For example, if we are able to 
allocate all the blocks of a delta in the same 100 KB range, the probability 
that the update will be split across two tracks is about one in ten. However, 
spanning two tracks does not cost a full additional seek because the upper 
part of the split may be written to media on the same rotation as the lower.

>From the above, it seems likely that our allocation policy will often be able 
to allocate the 6 blocks in question within the same track, most of the time, 
if we take care to leave some slack space distributed across the volume so 
that redirected blocks need not be moved far away from their "home" target. 
10% slack space seems reasonable, which would be reduced if a volume becomes 
nearly full.

Future

In the future, Tux3 will log metadata more aggressively. In particular, we may 
log namespace changes instead of committing full directory block per delta, 
deferring the actual directory block writes to a rollup. This would be a 
substantial win where there are multiple changes to the same directory, a 
common load.

Regards,

Daniel