Re: [PATCH 0/1] mm: restore full accuracy in COW page reuse

From: Andrea Arcangeli
Date: Sat Jan 09 2021 - 22:18:57 EST


Hello Linus,

On Sat, Jan 09, 2021 at 05:19:51PM -0800, Linus Torvalds wrote:
> +#define is_cow_mapping(flags) (((flags) & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE)
> +
> +static inline bool pte_is_pinned(struct vm_area_struct *vma, unsigned long addr, pte_t pte)
> +{
> + struct page *page;
> +
> + if (!is_cow_mapping(vma->vm_flags))
> + return false;
> + if (likely(!atomic_read(&vma->vm_mm->has_pinned)))
> + return false;
> + page = vm_normal_page(vma, addr, pte);
> + if (!page)
> + return false;
> + if (page_mapcount(page) != 1)
> + return false;
> + return page_maybe_dma_pinned(page);
> +}

I just don't see the simplification coming from
09854ba94c6aad7886996bfbee2530b3d8a7f4f4. Instead of checking
page_mapcount above as an optimization, to me it looks much simpler to
check it in a single place, in do_wp_page, that in addition of
optimizing away the superfluous copy, would optimize away the above
complexity as well.

And I won't comment if it's actually safe to skip random pages or
not. All I know is for mprotect and uffd-wp, definitely the above
approach wouldn't work.

In addition I dislike has_pinned and FOLL_PIN.

has_pinned 450 include/linux/mm_types.h * for instance during page table copying for fork().
has_pinned 1021 kernel/fork.c atomic64_set(&mm->pinned_vm, 0);
has_pinned 1239 mm/gup.c atomic_set(&mm->has_pinned, 1);
has_pinned 2579 mm/gup.c atomic_set(&current->mm->has_pinned, 1);
has_pinned 1099 mm/huge_memory.c atomic_read(&src_mm->has_pinned) &&
has_pinned 1213 mm/huge_memory.c atomic_read(&src_mm->has_pinned) &&
has_pinned 819 mm/memory.c if (likely(!atomic_read(&src_mm->has_pinned)))

Are we spending 32bit in mm_struct atomic_t just to call atomic_set(1)
on it? Why isn't it a MMF_HAS_PINNED that already can be set
atomically under mmap_read_lock too? There's bit left free there, we
didn't run out yet to justify wasting another 31 bits. I hope I'm
overlooking something.

The existence of FOLL_LONGTERM is good and makes a difference at times
for writeback if it's on a MAP_SHARED, or it makes difference during
GUP to do a page_migrate before taking the pin, but for the whole rest
of the VM it's irrelevant if it's long or short term, so I'm also
concerned from what Jason mentioned about long term pins being treated
differently within the VM. That to me looks fundamentally as flawed as
introducing inaccuracies in do_wp_page, that even ignoring the
performance implications caused by the inaccuracy, simply prevent to
do some useful things.

I obviously agree all common workloads with no GUP pins and no
clear_refs and no uffd, are way more important to be optimal, but I
haven't seen a single benchmark showing the benefit of not taking the
PG_lock before a page copy or any other runtime benefit coming from
page_count in do_wp_page.

To the contrary now I see additional branches in fork and in various
other paths.

The only thing again where I see page_count provides a tangible
benefit, is the vmsplice attack from child to parent, but that has not
been fully fixed and if page_count is added to fix it in all COW
faults, it'll introduce extra inefficiency to the the very common
important workloads, not only to the special GUP/clear_refs/uffd-wp
workloads as your patch above shows.

Thanks,
Andrea