page forking

Wed Mar 18 13:11:49 PDT 2015

Hi Raymond,

Sorry for the long lag on the reply, I have been fully occupied with fsync 
optimization work for the last little while.

On Thursday, February 26, 2015 4:27:26 PM PDT, Raymond Jennings wrote:
> On Mon, 2015-02-16 at 19:20 +0900, OGAWA Hirofumi wrote:
>> Raymond Jennings <shentino at gmail.com> writes:
>> 
>>> I'm a bit curious about page forking.
>>> 
>>> What exactly is it, how useful is it outside of tux3, and how easy
>>> will it be to merge into the mainline?
>> 
>> It is possible to try though. I'm still busy to implement other tux3
>> codes. (Working on better fsync/sync)
>
> I would honestly suggest working on pageforking first.  Once it's merged
> you can play with the other stuff.

>From the point view of other filesystem developers, page forking is still
black magic. They know it is something we talk about it a lot, they know
we make impressive benchmark claims from time to time (which are true) but
until they see it in mainline, it does not seem real.

My assessment is, if we drop everything and spend all our time trying to
merge page forking in order to benefit a customer base that does not
exist, we will never ever get merged. In other words, project suicide, so
that is why we are not going to go that route.

Instead, we will concentrate on working with core devs, particularly Jan
Kara, to generalize the bdi writeback algorithm to add a per-superblock
writeback capability. We already had a go-around on that last summer,
and we are now about ready to pick up where that left off.

>>> If it's useful for more than just tux3 I think it should be merged as
>>> a separate patch.
>> 
>> For now, it is low priority for me. Because FS needs to some rewrite to
>> use page fork model, and it will take long time discussion/flames to
>> merge. (There were some discussions on lkml in past.)
>
> If page forking is beneficial to more than just tux3, I suggest merging
> that first and then working on getting tux3 merged.  Since tux3 depends
> on page forking.
>
> I'd have to agree with them that it should be merged separately.
>
> Either that or find a way for tux3 to not depend on page forking.

Never going to happen. Take away page forking and you might as well scrap
the entire design. I am not sure if this is clear to you, but page forking
is implemented inside Tux3, with only minor changes to core to support it.
Moving it into core would not benefit Tux3 at all, and nobody else wants
it badly enough to do the significant amount of work required.

Basically, the only practical way that page forking will go into core is
if somebody wants to get a master's thesis out of it, for example by
modifying Ext4 to use it.

So thank you for the suggestion, but we have a pretty clear idea what
our merge path is. We will work through the per-superblock bdi flush
design issues with core devs. Our current core patch for bdi flush is
essentially a workaround that does not entirely get rid of the old,
broken idea of walking dirty inode lists to do writeback scheduling.
It is a pragmatic approach that works pretty well, but falls well
short of elegant.

To improve it, we have in mind a modified bdi flush algorithm that
does not need to examine the dirty inode lists to do writeback
scheduling, but just keeps track of inode dirty times instead.
Filesystems using the old, per-inode writeback scheme (currently all
filesystems except Tux3) will still get the per-inode flush calls
in a pattern closely approximating the current behavior. Tux3 will
receive its per-sb flush calls in approximately the same pattern
as now, but bdi flush will do it without needing to access dirty
inode lists at all, essentially turning the inode dirty link field
into fs-private data.

This approach solves the tricky question: how can per-inode
and per-superblock filesystems play together nicely on the same
device? Our answer: separate the inode dirty time tracking from
the inode flushing, and otherwise use a very similar algorithm.

This work is going to start pretty soon, after the current fsync
work and the upcoming ENOSPC work. We will need to work with core
devs on it, and we will need to supply a patch to prove that the
approach is valid and does not break existing filesystems. The
thing is, there is already general agreement that bdi flush needs
to be modernized, so we will not be pushing uphill on that.

Regards,

Daniel