Re: [PATCH 1/4] mm: Trial do_wp_page() simplification
From: Jason Gunthorpe
Date: Fri Sep 18 2020 - 20:33:38 EST
On Fri, Sep 18, 2020 at 01:59:41PM -0700, Linus Torvalds wrote:
> Honestly, if we had a completely *reliable* sign of "this page is
> pinned", then I think the much nicer option would be to just say
> "pinned pages will not be copied at all". Kind of an implicit
> VM_DONTCOPY.
It would be simpler to implement, but it makes the programming model
really sketchy. For instance O_DIRECT is using FOLL_PIN, so imagine
this program:
CPU0 CPU1
a = malloc(1024);
b = malloc(1024);
read(fd, a, 1024); // FD is O_DIRECT
... fork()
*b = ...
read completes
Here a and b got lucky and both come from the same page due to the
allocator.
In this case the fork() child in CPU1, would be very surprised that
'b' was not mapped into the fork.
Similiarly, CPU0 would have silent data corruption if the read didn't
deposit data into 'a' - which is a bug we have today. In this race the
COW break of *b might steal the physical page to the child, and *a
won't see the data. For this reason, John is right, fork needs to
eventually do this for O_DIRECT as well.
The copy on fork nicely fixes all of this weird oddball stuff.
Jason