Re: [PATCH v1 0/7] mm: COW fixes part 3: reliable GUP R/W FOLL_GET of anonymous pages

From: David Hildenbrand
Date: Sat Mar 19 2022 - 07:17:27 EST


On 19.03.22 00:48, Jason Gunthorpe wrote:
> On Tue, Mar 15, 2022 at 03:18:30PM +0100, David Hildenbrand wrote:
>> This is just the natural follow-up of part 2, that will also further
>> reduce "wrong COW" on the swapin path, for example, when we cannot remove
>> a page from the swapcache due to concurrent writeback, or if we have two
>> threads faulting on the same swapped-out page. Fixing O_DIRECT is just a
>> nice side-product :)

Hi Jason,

thanks or the review!

>
> I know I would benefit alot from a description of the swap specific
> issue a bit more. Most of this message talks about clear_refs which I
> do understand a bit better.

Patch #1 contains some additional information. In general, it's the same
issue as with any other mechanism that could get the page mapped R/O
while there is a FOLL_GET | FOLL_WRITE reference to it -- for example,
DMA to that page as happens with our O_DIRECT reproducer.

Part 2 essentially fixed the other cases (i.e., clear_refs), but the
remaining swapout+refault from swapcache case is handled in this series.

>
> Is this talking about what happens after a page gets swapped back in?
> eg the exclusive bit is missing when the page is recreated?

Right, try_to_unmap() was the last remaining case where we'd have lost
the exclusivity information -- it wasn't required for reliable GUP pins
in part 2.

Here is what happens without PG_anon_exclusive:

1. The application uses parts of an anonymous base page for direct I/O,
let's assume the first 512 bytes of page.

fd = open(filename, O_DIRECT| ...);
pread(fd, page, 512, 0);

O_DIRECT will take a FOLL_GET|FOLL_WRITE reference on the page

2. Reclaim kicks in and wants to swapout the page -- mm/vmscan.c

shrink_page_list() first adds the page to the swapcache and then unmaps
it via try_to_unmap().

After the page was successfully unmapped, pageout() will start
triggering writeback but will realize that there are additional
references on the page (via is_page_cache_freeable()) and fail.

3. The application uses unrelated parts of the page for other purposes
while the DMA is not completed, e.g., doing a a simple

page[4095]++;

The read access will fault in the page readable from the swap cache in
do_swap_page(). The write access will trigger our COW fault handler. As
we have an additional reference on the page, we will create a copy and
map it into out page table. At this point, the page table and the GUP
reference are out of sync.

4. O_DIRECT completes

The read targets the page that is no longer referenced in the page
tables. For the application, it looks like the read() never happened, as
we lost our DMA read to our page.


With PG_anon_exclusive from series part 2, we don't remember exclusivity
information in try_to_unmap() yet. do_swap_page() cannot restore it as
it has to assume the page is possibly shared.

With this series, we remember exclusivity information in try_to_unmap()
in the SWP PTE. do_swap_page() can restore it. Consequently, our COW
fault handler won't create a wrong copy and we won't go out of sync
between GUP and the page mapped into the page table.


Hope that helps!

--
Thanks,

David / dhildenb