Draft to optimize wait_sb_indoes()

Daniel Phillips daniel.raymond.phillips at gmail.com
Mon Jun 24 16:06:43 PDT 2013


A beautifully clear explanation and big efficiency gain from a tiny core
patch.
On Jun 24, 2013 3:03 AM, "OGAWA Hirofumi" <hirofumi at mail.parknet.co.jp>
wrote:

> Hi,
>
> On the following stress, sync(2) became top of cpu load.
>
>         fsstress -s 1 -n 50000 -p 3 -d `pwd`
>
> After profile by perf, cpu load was lock contention of inode_sb_list_lock.
>
> sync(2) is data integrity operation, so it has to make sure all dirty
> data was written before sync(2) point. The bdi flusher flushes current
> dirty data and wait those. But, if there is in-flight I/O before
> sync_inodes_sb(), sync_inodes_sb() can be done before prior in-flight I/O.
>
> So, wait_sb_inodes() is walking the all inodes on sb to make sure
> in-flight I/O was done too. When it is walking inodes,
> inode_sb_list_lock is contending with other operations like
> create(2). This is the cause of cpu load.
>
> On another view, wait_sb_inodes() would (arguably) be necessary for
> legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
> would be more than useless, because ext* can be done it by own
> transaction list (and more efficient way).
>
> Likewise, on tux3, the state is same with data=journal.
>
> Also, even if data=ordered, ext* might be able to check in-flight I/O by
> ordered data list (with some new additional check, I'm not sure).
>
>
> So, this patch adds the sb->s_op->wait_inodes() handler to replace
> wait_sb_inodes(). With this, FSes can optimize it by using own
> internal knowledge.
>
> [Alternative idea to optimize this, push down wait_sb_inodes() into
> ->sync_fs() handler on all FSes.]
>
>
> The following is profile of result on tux3 (->wait_inodes() handler is
> noop function).
>
> -  13.11%  fsstress  [kernel.kallsyms]  [k] sync_inodes_sb
>    - sync_inodes_sb
>       - 99.97% sync_inodes_one_sb
>            iterate_supers
>            sys_sync
>            system_call
>            sync
> -   9.39%  fsstress  [kernel.kallsyms]  [k] _raw_spin_lock
>    - _raw_spin_lock
>       + 62.72% sync_inodes_sb
>       + 7.46% _atomic_dec_and_lock
>       + 6.15% inode_add_lru
>       + 4.43% map_region
>       + 3.07% __find_get_buffer
>       + 2.71% sync_inodes_one_sb
>       + 1.77% tux3_set_buffer_dirty_list
>       + 1.43% find_inode
>       + 1.02% iput
>       + 0.69% dput
>       + 0.57% __tux3_get_block
>       + 0.53% shrink_dcache_parent
>       + 0.53% __d_instantiate
>
> Before patch:
>
> real 2m40.994s
> user 0m1.344s
> sys  0m15.832s
>
> After patch:
>
> real 2m34.748
> user 0m1.360s
> sys  0m5.356s
>
> Signed-off-by: OGAWA Hirofumi <hirofumi at mail.parknet.co.jp>
> ---
>
>  fs/fs-writeback.c  |    5 ++++-
>  include/linux/fs.h |    1 +
>  2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff -puN include/linux/fs.h~tux3-fix-wait_sb_inodes include/linux/fs.h
> --- tux3fs/include/linux/fs.h~tux3-fix-wait_sb_inodes   2013-06-24
> 18:36:45.000000000 +0900
> +++ tux3fs-hirofumi/include/linux/fs.h  2013-06-24 18:36:45.000000000 +0900
> @@ -1595,6 +1595,7 @@ struct super_operations {
>         int (*write_inode) (struct inode *, struct writeback_control *wbc);
>         int (*drop_inode) (struct inode *);
>         void (*evict_inode) (struct inode *);
> +       void (*wait_inodes)(struct super_block *);
>         void (*put_super) (struct super_block *);
>         int (*sync_fs)(struct super_block *sb, int wait);
>         int (*freeze_fs) (struct super_block *);
> diff -puN fs/fs-writeback.c~tux3-fix-wait_sb_inodes fs/fs-writeback.c
> --- tux3fs/fs/fs-writeback.c~tux3-fix-wait_sb_inodes    2013-06-24
> 18:36:45.000000000 +0900
> +++ tux3fs-hirofumi/fs/fs-writeback.c   2013-06-24 18:36:45.000000000 +0900
> @@ -1401,7 +1401,10 @@ void sync_inodes_sb(struct super_block *
>         bdi_queue_work(sb->s_bdi, &work);
>         wait_for_completion(&done);
>
> -       wait_sb_inodes(sb);
> +       if (sb->s_op->wait_inodes)
> +               sb->s_op->wait_inodes(sb);
> +       else
> +               wait_sb_inodes(sb);
>  }
>  EXPORT_SYMBOL(sync_inodes_sb);
>
> _
>
> --
> OGAWA Hirofumi <hirofumi at mail.parknet.co.jp>
>
> _______________________________________________
> Tux3 mailing list
> Tux3 at phunq.net
> http://phunq.net/mailman/listinfo/tux3
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phunq.net/pipermail/tux3/attachments/20130624/3d6fa65c/attachment.html>


More information about the Tux3 mailing list