Draft to optimize wait_sb_indoes()

Raymond Jennings shentino at gmail.com
Mon Jun 24 17:27:28 PDT 2013


Also I think it's spelled "inodes", not "indoes"

If you have this incorrect spelling anywhere in a script or document I
suggest fixing it.


On Mon, Jun 24, 2013 at 4:06 PM, Daniel Phillips <
daniel.raymond.phillips at gmail.com> wrote:

> A beautifully clear explanation and big efficiency gain from a tiny core
> patch.
> On Jun 24, 2013 3:03 AM, "OGAWA Hirofumi" <hirofumi at mail.parknet.co.jp>
> wrote:
>
>> Hi,
>>
>> On the following stress, sync(2) became top of cpu load.
>>
>>         fsstress -s 1 -n 50000 -p 3 -d `pwd`
>>
>> After profile by perf, cpu load was lock contention of inode_sb_list_lock.
>>
>> sync(2) is data integrity operation, so it has to make sure all dirty
>> data was written before sync(2) point. The bdi flusher flushes current
>> dirty data and wait those. But, if there is in-flight I/O before
>> sync_inodes_sb(), sync_inodes_sb() can be done before prior in-flight I/O.
>>
>> So, wait_sb_inodes() is walking the all inodes on sb to make sure
>> in-flight I/O was done too. When it is walking inodes,
>> inode_sb_list_lock is contending with other operations like
>> create(2). This is the cause of cpu load.
>>
>> On another view, wait_sb_inodes() would (arguably) be necessary for
>> legacy FSes. But, for example, if data=journal on ext*, wait_sb_inodes()
>> would be more than useless, because ext* can be done it by own
>> transaction list (and more efficient way).
>>
>> Likewise, on tux3, the state is same with data=journal.
>>
>> Also, even if data=ordered, ext* might be able to check in-flight I/O by
>> ordered data list (with some new additional check, I'm not sure).
>>
>>
>> So, this patch adds the sb->s_op->wait_inodes() handler to replace
>> wait_sb_inodes(). With this, FSes can optimize it by using own
>> internal knowledge.
>>
>> [Alternative idea to optimize this, push down wait_sb_inodes() into
>> ->sync_fs() handler on all FSes.]
>>
>>
>> The following is profile of result on tux3 (->wait_inodes() handler is
>> noop function).
>>
>> -  13.11%  fsstress  [kernel.kallsyms]  [k] sync_inodes_sb
>>    - sync_inodes_sb
>>       - 99.97% sync_inodes_one_sb
>>            iterate_supers
>>            sys_sync
>>            system_call
>>            sync
>> -   9.39%  fsstress  [kernel.kallsyms]  [k] _raw_spin_lock
>>    - _raw_spin_lock
>>       + 62.72% sync_inodes_sb
>>       + 7.46% _atomic_dec_and_lock
>>       + 6.15% inode_add_lru
>>       + 4.43% map_region
>>       + 3.07% __find_get_buffer
>>       + 2.71% sync_inodes_one_sb
>>       + 1.77% tux3_set_buffer_dirty_list
>>       + 1.43% find_inode
>>       + 1.02% iput
>>       + 0.69% dput
>>       + 0.57% __tux3_get_block
>>       + 0.53% shrink_dcache_parent
>>       + 0.53% __d_instantiate
>>
>> Before patch:
>>
>> real 2m40.994s
>> user 0m1.344s
>> sys  0m15.832s
>>
>> After patch:
>>
>> real 2m34.748
>> user 0m1.360s
>> sys  0m5.356s
>>
>> Signed-off-by: OGAWA Hirofumi <hirofumi at mail.parknet.co.jp>
>> ---
>>
>>  fs/fs-writeback.c  |    5 ++++-
>>  include/linux/fs.h |    1 +
>>  2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff -puN include/linux/fs.h~tux3-fix-wait_sb_inodes include/linux/fs.h
>> --- tux3fs/include/linux/fs.h~tux3-fix-wait_sb_inodes   2013-06-24
>> 18:36:45.000000000 +0900
>> +++ tux3fs-hirofumi/include/linux/fs.h  2013-06-24 18:36:45.000000000
>> +0900
>> @@ -1595,6 +1595,7 @@ struct super_operations {
>>         int (*write_inode) (struct inode *, struct writeback_control
>> *wbc);
>>         int (*drop_inode) (struct inode *);
>>         void (*evict_inode) (struct inode *);
>> +       void (*wait_inodes)(struct super_block *);
>>         void (*put_super) (struct super_block *);
>>         int (*sync_fs)(struct super_block *sb, int wait);
>>         int (*freeze_fs) (struct super_block *);
>> diff -puN fs/fs-writeback.c~tux3-fix-wait_sb_inodes fs/fs-writeback.c
>> --- tux3fs/fs/fs-writeback.c~tux3-fix-wait_sb_inodes    2013-06-24
>> 18:36:45.000000000 +0900
>> +++ tux3fs-hirofumi/fs/fs-writeback.c   2013-06-24 18:36:45.000000000
>> +0900
>> @@ -1401,7 +1401,10 @@ void sync_inodes_sb(struct super_block *
>>         bdi_queue_work(sb->s_bdi, &work);
>>         wait_for_completion(&done);
>>
>> -       wait_sb_inodes(sb);
>> +       if (sb->s_op->wait_inodes)
>> +               sb->s_op->wait_inodes(sb);
>> +       else
>> +               wait_sb_inodes(sb);
>>  }
>>  EXPORT_SYMBOL(sync_inodes_sb);
>>
>> _
>>
>> --
>> OGAWA Hirofumi <hirofumi at mail.parknet.co.jp>
>>
>> _______________________________________________
>> Tux3 mailing list
>> Tux3 at phunq.net
>> http://phunq.net/mailman/listinfo/tux3
>>
>
> _______________________________________________
> Tux3 mailing list
> Tux3 at phunq.net
> http://phunq.net/mailman/listinfo/tux3
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://phunq.net/pipermail/tux3/attachments/20130624/f3996fae/attachment.html>


More information about the Tux3 mailing list