[PATCH] Optimize wait_sb_inodes()

Thu Jun 27 02:40:42 PDT 2013

On Thu, Jun 27, 2013 at 04:22:04PM +0900, OGAWA Hirofumi wrote:
> Dave Chinner <david at fromorbit.com> writes:
> 
> >> Otherwise, vfs can't know the data is whether after sync point or before
> >> sync point, and have to wait or not. FS is using the behavior like
> >> data=journal has tracking of those already, and can reuse it.
> >
> > The VFS writeback code already differentiates between data written
> > during a sync operation and that dirtied after a sync operation.
> > Perhaps you should look at the tagging for WB_SYNC_ALL writeback
> > that write_cache_pages does....
> >
> > But, anyway, we don't have to do that on the waiting side of things.
> > All we need to do is add the inode to a "under IO" list on the bdi
> > when the mapping is initially tagged with pages under writeback,
> > and remove it from that list during IO completion when the mapping
> > is no longer tagged as having pages under writeback.
> >
> > wait_sb_inodes() just needs to walk that list and wait on each inode
> > to complete IO. It doesn't require *any awareness* of the underlying
> > filesystem implementation or how the IO is actually issued - if
> > there's IO in progress at the time wait_sb_inodes() is called, then
> > it waits for it.
> >
> >> > Fix the root cause of the problem - the sub-optimal VFS code.
> >> > Hacking around it specifically for out-of-tree code is not the way
> >> > things get done around here...
> >> 
> >> I'm thinking the root cause is vfs can't have knowledge of FS internal,
> >> e.g. FS is handling data transactional way, or not.
> >
> > If the filesystem has transactional data/metadata that the VFS is
> > not tracking, then that is what the ->sync_fs call is for. i.e. so
> > the filesystem can then do what ever extra writeback/waiting it
> > needs to do that the VFS is unaware of.
> >
> > We already cater for what Tux3 needs in the VFS - all you've done is
> > found an inefficient algorithm that needs fixing.
> 
> write_cache_pages() is library function to be called from per-FS. So, it
> is not under vfs control can be assume already. And it doesn't do right
> things via write_cache_pages() for data=journal, because it handles for
> each inodes, not at once. So, new dirty data can be inserted while
> marking.

Sure it can. But that newly dirtied data has occurred after the data
integrity writeback call was begun, so it's not part of what the
writeback code call needs to write back. We are quite entitled to
ignore it for the purposes of a data integrity sync because it as
dirtied *after* write_cache_pages() was asked to sync the range of
the inode.

IOWs, the VFS draws a line in the sand at a point in time when each
inode is written for a data integrity sync. You have to do that
somewhere, and there's little point in making that a global barrier
when it is not necessary to do so.

tux3 draws a different line in the sand, as does ext3/4
data=journal.  In effect, tux3 and ext3/4 data=journal define a
global point in time that everything is "in sync", and that's way
above what is necessary for a sync(2) operation.  The VFS already
has equivalent functionality - it's the state we force filesystems
into when they are frozen. i.e. freezing a filesystem forces it down
into a state where it is transactionally consistent on disk w.r.t
both data and metadata.  sync(2) does not require these
"transactionally consistent" semantics, so the VFS does not try to
provide them.

Anyway, this is a moot discussion. I've already got prototype code
that fixes the wait_sb_inodes() problem as somebody is having
problems with many concurrent executions of wait_sb_inodes() causing
severe lock contention...

Cheers,

Dave.
-- 
Dave Chinner
david at fromorbit.com