Tux3 report: New news for the new year

Wed Jan 2 03:03:13 PST 2013

On Tuesday, January 01, 2013 10:58:35 PM Shentino wrote:
> From what I can tell on the design, tux3 is "fsync satiating" with a
> single disk write.  It writes the data to the final location, updates
> the log, and at that point the data is considered committed and it can
> let userspace go on its merry way and take care of rolling up the
> changes later.

Yes, correct. I think we currently sync a small file create+write with seven 
blocks and a file rewrite with four blocks, including the commit block and only 
one long seek. We haven't benchmarked that yet, but it sounds fast. There are 
two synchronous waits in the backend, but the frontend only waits on the 
commit block completion in the task doing the sync while other concurrent 
filesystem operations just keep going.

> If I understand btrfs correctly though it has to block
> until the cow logic percolates all the way up to the superblock.

A careful reading of the Btrfs design doc left me confused about that. Perhaps 
Btrfs devs could clarify?

> One other thing that interests me is this "page forking" that allows
> userspace to write to a page that's already busy being written to
> disk.  From what I heard it bypasses a stall caused by userspace I/O
> hitting a locked page.

Page forking is an amazing thing and should really head into core, after being 
thoroughly proved out of course.

> Finally, atime handling.  I personally dislike the forced default of
> "relatime" for mount options and anything that can let atime updates
> happen without being a bottleneck is a plus for me.

Atime is an odious invention indeed from a developer's perspective, but 
apparently well loved by some users and has real applications. Knowing which 
videos you watched recently apparently being one of them. We have a pretty 
good plan for it that is actually just a small development item, the main 
feature of which is avoiding polluting the inode table btree, which would 
cause a lot of churn and aggravate allocate-on-write issues that are already 
difficult, plus be horribly unfriendly to flash. Instead, we churn a dedicated 
btree array (actually a regular file) where the write-on-reads are densely 
concentrated. It somehow feels good to quarantine this craziness at least.

Regards,

Daniel