[Tux3] Design note: Data flush and rename ordering
Daniel Phillips
phillips at phunq.net
Tue Mar 24 16:37:48 PDT 2009
I will touch briefly on the recent Ext4 issue, where files written by
the write-to-temporary/rename-as-target method would frequently show
up as zero length after a crash instead of having the original or
updated data.
The problem is, Ext4 holds recently written file data in cache even
across an atomic (journalled) update of directory metadata. While a
strict reading of Posix permits this, application writers do not expect
it and I think we want to define stronger semantics for Tux3. That is,
we should guarantee that a rename will never be committed to disk
before the source file of the rename is flushed.
Our initial implementation of atomic commit will always flush every
dirty inode to disk at each delta transition, which provides the above
guarantee by default. That is, a rename will always be committed in or
after the delta that flushes its source inode.
Later, we will move to a model where only some inodes are flushed at
each delta, allowing more dirty file data to be retained in memory
under heavy write loads, with a corresponding improvement in transfer
efficiency. At that point we need to do something special to satisfy
the proposed flush-before-rename rule. Supposing we keep a list of
inodes scheduled to be flushed in the current delta, each rename just
needs to move the source inode to that list, if the source inode has
dirty pages in page cache.
This additional requirement for rename is unlikely to reduce write
performance perceptably, because write/rename loads are relatively
rare. This strategy is typically used for update of application
config files, or for mail delivery, where the application write
relies on rename as the only Posix means of atomically transitioning
from one file state to another. Mass file writes such as cp -a or
untar do not typically use the write/rename method.
It has also been proposed that application writers should not expect
a write/rename operation to guarantee any ordering between data flush
and rename, and that applications should be recoded to place an fsync
before the rename where such ordering is important. I side with the
application writers on this: expectations of write/rename semantics
are well established, and it is now incumbent on filesystem developers
to fulfill these expectations, even though a strict reading of Posix
does not require it. Imposing a new requirement for fsync before
rename would slow down file operations significantly under many common
loads, much more so than introducing a barrier between data flush and
rename commit. Most importantly, large numbers of applications expect
this implicit ordering, because in the past, filesystems have always
seemed to provide it.
Regards,
Daniel
_______________________________________________
Tux3 mailing list
Tux3 at tux3.org
http://mailman.tux3.org/cgi-bin/mailman/listinfo/tux3
More information about the Tux3
mailing list