[FYI] tux3: Core changes

Fri May 15 04:00:41 PDT 2015

On Fri, May 15, 2015 at 02:54:48AM -0700, Daniel Phillips wrote:
> 
> 
> On 05/15/2015 01:09 AM, Mel Gorman wrote:
> > On Thu, May 14, 2015 at 11:06:22PM -0400, Rik van Riel wrote:
> >> On 05/14/2015 08:06 PM, Daniel Phillips wrote:
> >>>> The issue is that things like ptrace, AIO, infiniband
> >>>> RDMA, and other direct memory access subsystems can take
> >>>> a reference to page A, which Tux3 clones into a new page B
> >>>> when the process writes it.
> >>>>
> >>>> However, while the process now points at page B, ptrace,
> >>>> AIO, infiniband, etc will still be pointing at page A.
> >>>>
> >>>> This causes the process and the other subsystem to each
> >>>> look at a different page, instead of at shared state,
> >>>> causing ptrace to do nothing, AIO and RDMA data to be
> >>>> invisible (or corrupted), etc...
> >>>
> >>> Is this a bit like page migration?
> >>
> >> Yes. Page migration will fail if there is an "extra"
> >> reference to the page that is not accounted for by
> >> the migration code.
> > 
> > When I said it's not like page migration, I was referring to the fact
> > that a COW on a pinned page for RDMA is a different problem to page
> > migration. The COW of a pinned page can lead to lost writes or
> > corruption depending on the ordering of events.
> 
> I see the lost writes case, but not the corruption case,

Data corruption can occur depending on the ordering of events and the
applications expectations. If a process starts IO, RDMA pins the page
for read and forks are combined with writes from another thread then when
the IO completes the reads may not be visible. The application may take
improper action at that point.

Users of RDMA are typically expected to use MADV_DONTFORK to avoid this
class of problem.

You can choose to not define this as data corruption because thge kernel
is not directly involved and that's your call.

> Do you
> mean corruption by changing a page already in writeout? If so,
> don't all filesystems have that problem?
> 

No, the problem is different. Backing devices requiring stable pages will
block the write until the IO is complete. For those that do not require
stable pages it's ok to allow the write as long as the page is dirtied so
that it'll be written out again and no data is lost.

> If RDMA to a mmapped file races with write(2) to the same file,
> maybe it is reasonable and expected to lose some data.
> 

In the RDMA case, there is at least application awareness to work around
the problems. Normally it's ok to have both mapped and write() access
to data although userspace might need a lock to co-ordinate updates and
event ordering.

-- 
Mel Gorman
SUSE Labs