Tux3 Report: Fsync Redux

Mon Jun 10 02:02:42 PDT 2013

One month ago we reported that Tux3 is now able to beat tmpfs in at 
least one benchmark load, a claim that generated considerable lively 
commentary, some of it collegial:

     https://lkml.org/lkml/2013/5/7/742

Ted raised an interesting question about whether Tux3 is forever 
destined to suffer from slow fsync as an inherent design tradeoff. To be 
clear about it: that was never part of our plan.

 From early on, we intended to rely on full filesystem sync in place of 
fsync in the interest of correctness, then implement a properly 
optimized fsync later, probably after mainline merge. Elaborating our 
simple, rugged full-filesystem atomic commit strategy to accomodate 
efficient fsync always seemed to be a relatively easy design problem 
compared to the daunting task of adding strong consistency properties to 
a filesystem primarily oriented towards writing out single inodes. In 
the event, our intuition proved correct.

First, let me characterize the case where sync performs measurably worse 
than a purpose-built fsync. If Tux3 has just one dirty file and does a 
full filesystem sync, it performs well because only a few blocks need to 
be written. However, if there are dirty inodes in cache that fsync could 
bypass, sync may exhibit much higher latency than a well targetted fsync.

The arguably artificial dbench benchmark load rewards such dirty inode 
bypass richly by deleting most files while they are still retained in 
cache, thus avoiding a large amount of disk traffic. The question of 
whether this models any real world situation accurately is beside the 
point: if Tux3 wants to win at pure, unadulterated dbench, it better not 
let fsync push other dirty inodes to disk.

We set the following goals for our new, improved fsync design:

   * Meet or beat Ext4 performance
   * Keep our strong consistency semantics (similar to Ext3 data=journal)
   * No performance regression for non-fsync updates
   * Simple and provably correct

As always, the "simple" requirement is the biggest challenge. Various 
complex and finicky solutions suggest themselves. For example, we could 
analyze block level dependencies and write out just those blocks 
involved with a given dirty file. I can imagine such an approach 
succeeding, but cannot imagine it being simple.

After a few days of worrying away at the issues, I hit on a simple, 
effective idea. This was really just a matter of understanding our 
existing design more deeply rather than any fundamental breakthrough. 
(Note: we still do not intend to implement this right now, only design 
it with a view to putting to rest doubts that were expressed about the 
innate ability of Tux3 to handle fsync efficiently.)

Background

Tux3 normally divides episodes of filesystem update activity into chunks 
called "deltas". Each delta includes all data, namespace and attribute 
changes caused by some arbitrary set of syscalls. Each delta cleans all 
dirty cache. This is perfectly efficient for many common filesystem 
loads, but not all. Fsync is a good example of a case where we would 
like to leave most dirty data sitting in cache while committing just the 
dirty blocks of a single file. In other words, dirty data for an fsynced 
file needs to "jump the queue" of delta commits in the interest of 
efficiency.

Ext4 gets this kind of behavior more or less for free, because it is 
designed to write out whichever inode that core kernel tells it to, 
whenever it is told to. In contrast, Tux3 ignores core kernel's plan for 
which inodes should be written (taking the view that core kernel just 
does not understand the problem very well) and always writes out all of 
them. Our immediate project is to relax that behavior to support writing 
out just one of them.

The first big step forward came from noticing that it is easy to move 
dirty file data forward from the "current" delta (in process of being 
modified by syscalls) to the pervious delta (scheduled for commit to 
disk). For each dirty page, Tux3 conceptually maintains two dirty bits, 
one for the current and one for the previous delta. The "forking" 
mechanism protects any cache page that is dirty in the previous delta 
from being modified by subsequent syscalls. It is easy to walk the list 
of dirty pages of a given inode and change the state of all pages dirty 
in the current delta to be dirty in the previous delta instead. I call 
this the "promote" operation, which effectively takes a point in time 
snapshot of inode data. In our parlance, fsync promotes dirty data from 
the current to the previous delta. The remainder of the effort to turn 
this promote tool into an efficient fsync concerns feeding such promoted 
pages out to disk efficiently.

Subdelta concept rejected

My first stab at the fsync writeout problem revolved around a "subdelta"
concept. A subdelta would be an intermediate commit including just some 
of the dirty inodes belonging to a given delta. With this, fsync could 
interrupt the backend as it marshals some big delta for writeout, commit 
the fsynced inode, then resume committing the remainder of the delta.

This idea was vetoed by Hirofumi because it breaks our strong ordering
guarantee that we have grown rather fond of:

     If transaction A completes before transaction B begins, then
     transaction B will never be durably recorded unless transaction
     A is too.

I agree: stronger consistency semantics are user-friendly; we have 
become accustomed to them; we will hold that sacred as far as we can. So 
back to the drawing board.

Parallel log streams

The winning idea occurred practically simultaneously to both of us. We 
will elaborate our log design slightly to support parallel log streams. 
When the backend is in the middle of committing a big delta to disk, it 
may interrupt its work to start a new log chain that includes only a 
single inode with its modified attributes. When this completes, the 
backend resumes committing the remainder of the big delta.

For this to work properly there must not be any dependencies between the 
two log streams. In the case of regular files the log only needs to 
track block allocations and record new positions of any redirected 
blocks. To avoid colliding with changes belonging to the in-flight delta 
we will not modify the inode table immediately but rely on the fsync log 
to hold the updates until the in-flight delta completes. Then the 
backend will update the inode table to reflect the synced state in the 
following delta.

Various details need to be addressed to make this work. If an fsynced 
inode has already been marshalled and is on its way to disk in a 
previous delta then we need to wait for that delta to complete before 
committing the fsync. Doing otherwise would require significant design 
elaboration, and such a situation should be rare compared to the case 
where an fsynced inode is dirty in the previous delta but not yet 
marshalled (e.g., updated just before a delta transition and fsynced 
just after the transition). We must take care that an fsynced inode is 
not evicted before the inode table is updated to point to the new file 
data. Log replay becomes slightly more complex. Specifically, updated 
data attributes from fsync logs need to be entered into the inode table. 
This is a small amount of additional work to be done only on unexpected 
restart and has no efficiency impact. If an inode is newly created we 
must ensure that it is linked from some directory so that it will not 
leak on unexpected restart. For now we will fall back to sync for that 
situation, which guarantees that directories are consistent with the 
inode table.

That seems to be about it. No doubt we will discover a few unanticipated
details during implementation, however there is nothing more complex 
than or dissimilar from work we have already completed successfully. 
When we do get around to it, we should end up performing just fine in 
pure, unadulterated dbench, for what that is worth.

Directory fsync

We will not attempt to optimize directory fsync for the time being, not 
because of any inherent design limitation, but because we need to get 
back to work on the remaining issues that actually impact base 
functionality and currently stand in the way of Tux3 being practically 
usable. I will just mention that directory fsync involves consistency 
between directory data and the inode table, a topologically more complex 
problem than consistency between the inode table and regular file data. 
We are content to let this interesting problem simmer quietly on the 
back burner for the time being.

Shoutout to Samsung

I would like to thank Samsung warmly for providing me with a working 
situation where I can concentrate fully on Tux3 development as a member 
of the new Samsung Open Source Group.

Regards,

Daniel