From daniel at phunq.net Wed Mar 18 13:11:49 2015 From: daniel at phunq.net (Daniel Phillips) Date: Wed, 18 Mar 2015 13:11:49 -0700 Subject: page forking In-Reply-To: <1424996846.8393.1.camel@avalar> References: <87lhjyp37k.fsf@mail.parknet.co.jp> <1424996846.8393.1.camel@avalar> Message-ID: <4cde1903-9ca8-4ddd-bff3-5e2050f68ccc@phunq.net> Hi Raymond, Sorry for the long lag on the reply, I have been fully occupied with fsync optimization work for the last little while. On Thursday, February 26, 2015 4:27:26 PM PDT, Raymond Jennings wrote: > On Mon, 2015-02-16 at 19:20 +0900, OGAWA Hirofumi wrote: >> Raymond Jennings writes: >> >>> I'm a bit curious about page forking. >>> >>> What exactly is it, how useful is it outside of tux3, and how easy >>> will it be to merge into the mainline? >> >> It is possible to try though. I'm still busy to implement other tux3 >> codes. (Working on better fsync/sync) > > I would honestly suggest working on pageforking first. Once it's merged > you can play with the other stuff. >From the point view of other filesystem developers, page forking is still black magic. They know it is something we talk about it a lot, they know we make impressive benchmark claims from time to time (which are true) but until they see it in mainline, it does not seem real. My assessment is, if we drop everything and spend all our time trying to merge page forking in order to benefit a customer base that does not exist, we will never ever get merged. In other words, project suicide, so that is why we are not going to go that route. Instead, we will concentrate on working with core devs, particularly Jan Kara, to generalize the bdi writeback algorithm to add a per-superblock writeback capability. We already had a go-around on that last summer, and we are now about ready to pick up where that left off. >>> If it's useful for more than just tux3 I think it should be merged as >>> a separate patch. >> >> For now, it is low priority for me. Because FS needs to some rewrite to >> use page fork model, and it will take long time discussion/flames to >> merge. (There were some discussions on lkml in past.) > > If page forking is beneficial to more than just tux3, I suggest merging > that first and then working on getting tux3 merged. Since tux3 depends > on page forking. > > I'd have to agree with them that it should be merged separately. > > Either that or find a way for tux3 to not depend on page forking. Never going to happen. Take away page forking and you might as well scrap the entire design. I am not sure if this is clear to you, but page forking is implemented inside Tux3, with only minor changes to core to support it. Moving it into core would not benefit Tux3 at all, and nobody else wants it badly enough to do the significant amount of work required. Basically, the only practical way that page forking will go into core is if somebody wants to get a master's thesis out of it, for example by modifying Ext4 to use it. So thank you for the suggestion, but we have a pretty clear idea what our merge path is. We will work through the per-superblock bdi flush design issues with core devs. Our current core patch for bdi flush is essentially a workaround that does not entirely get rid of the old, broken idea of walking dirty inode lists to do writeback scheduling. It is a pragmatic approach that works pretty well, but falls well short of elegant. To improve it, we have in mind a modified bdi flush algorithm that does not need to examine the dirty inode lists to do writeback scheduling, but just keeps track of inode dirty times instead. Filesystems using the old, per-inode writeback scheme (currently all filesystems except Tux3) will still get the per-inode flush calls in a pattern closely approximating the current behavior. Tux3 will receive its per-sb flush calls in approximately the same pattern as now, but bdi flush will do it without needing to access dirty inode lists at all, essentially turning the inode dirty link field into fs-private data. This approach solves the tricky question: how can per-inode and per-superblock filesystems play together nicely on the same device? Our answer: separate the inode dirty time tracking from the inode flushing, and otherwise use a very similar algorithm. This work is going to start pretty soon, after the current fsync work and the upcoming ENOSPC work. We will need to work with core devs on it, and we will need to supply a patch to prove that the approach is valid and does not break existing filesystems. The thing is, there is already general agreement that bdi flush needs to be modernized, so we will not be pushing uphill on that. Regards, Daniel