[FYI] tux3: Core changes
jack at suse.cz
Wed May 27 01:41:38 PDT 2015
On Tue 26-05-15 13:22:38, Daniel Phillips wrote:
> On 05/26/2015 02:00 AM, Jan Kara wrote:
> > On Tue 26-05-15 01:08:56, Daniel Phillips wrote:
> >> On Tuesday, May 26, 2015 12:09:10 AM PDT, Jan Kara wrote:
> >>> E.g. video drivers (or infiniband or direct IO for that matter) which
> >>> have buffers in user memory (may be mmapped file), grab references to pages
> >>> and hand out PFNs of those pages to the hardware to store data in them...
> >>> If you fork a page after the driver has handed PFNs to the hardware, you've
> >>> just lost all the writes hardware will do.
> >> Hi Jan,
> >> The page forked because somebody wrote to it with write(2) or mmap write at
> >> the same time as a video driver (or infiniband or direct IO) was
> >> doing io to
> >> it. Isn't the application trying hard to lose data in that case? It
> >> would not need page fork to lose data that way.
> > So I can think of two valid uses:
> > 1) You setup IO to part of a page and modify from userspace a different
> > part of a page.
> Suppose the use case is reading textures from video memory into a mmapped
> file, and at the same time, the application is allowed to update the
> textures in the file via mmap or write(2). Fork happens at mkwrite time.
> If the page is already dirty, we do not fork it. The video API must have
> made the page writable and dirty, so I do not see an issue.
So there are a few things to have in mind:
1) There is nothing like a "writeable" page. Page is always writeable (at
least on x86 architecture). When a page is mapped into some virtual address
space (or more of them), this *mapping* can be either writeable or read-only.
mkwrite changes the mapping from read-only to writeable but kernel /
hardware is free to write to the page regardless of the mapping.
2) When kernel / hardware writes to the page, it first modifies the page
and then marks it dirty.
So what can happen in this scenario is:
1) You hand kernel a part of a page as a buffer. page_mkwrite() happens,
page is dirtied, kernel notes a PFN of the page somewhere internally.
2) Writeback comes and starts writeback for the page.
3) Kernel ships the PFN to the hardware.
4) Userspace comes and wants to write to the page (different part than the
HW is instructed to use). page_mkwrite is called, page is forked.
Userspace writes to the forked page.
5) HW stores its data in the original page.
Userspace never sees data from the HW! Data corrupted where without page
forking everything would work just fine.
Another possible scenario:
1) Userspace app tells kernel to setup a HW buffer in a page.
2) Userspace app fills page with data -> page_mkwrite is called, page is
3) Userspace app tells kernel to ship buffer to video HW.
4) Writeback comes and starts writeback for the page
5) Video HW is done with the page. Userspace app fills new set of data into
the page -> page_mkwrite is called, page is forked.
6) Userspace app tells kernel to ship buffer to video HW. But HW gets the
old data from the original page.
Again a data corruption issue where previously things were working fine.
> > 2) At least for video drivers there is one ioctl() which creates object
> > with buffers in memory and another ioctl() to actually ship it to hardware
> > (may be called repeatedly). So in theory app could validly dirty the pages
> > before it ships them to hardware. If this happens repeatedly and interacts
> > badly with background writeback, you will end up with a forked page in a
> > buffer and from that point on things are broken.
> Writeback does not fork pages. An app may dirty a page that is in process
> of being shipped to hardware (must be a distinct part of the page, or it is
> a race) and the data being sent to hardware will not be disturbed. If there
> is an issue here, I do not see it.
> > So my opinion is: Don't fork the page if page_count is elevated. You can
> > just wait for the IO if you need stable pages in that case. It's slow but
> > it's safe and it should be pretty rare. Is there any problem with that?
> That would be our fallback if anybody discovers a specific case where page
> fork breaks something, which so far has not been demonstrated.
> With a known fallback, it is hard to see why we should delay merging over
> that. Perfection has never been a requirement for merging filesystems. On
> the contrary, imperfection is a reason for merging, so that the many
> eyeballs effect may prove its value.
Sorry, but you've got several people telling you they are concerned about
the safety of your approach. That is the many eyeballs effect. And data
corruption issues aren't problems you can just wave away with "let's wait
whether it really happens".
Jan Kara <jack at suse.cz>
SUSE Labs, CR
More information about the Tux3