Re: [PATCH] sched,numa,mm: revert to checking pmd/pte_write instead of VMA flags

From: Rik van Riel
Date: Mon Sep 12 2016 - 11:09:57 EST


On Sun, 2016-09-11 at 17:24 +0100, Mel Gorman wrote:
> On Thu, Sep 08, 2016 at 09:30:53PM -0400, Rik van Riel wrote:
> > Commit 4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining
> > page table manipulations") changed NUMA balancing from _PAGE_NUMA
> > to using PROT_NONE, and was quickly found to introduce a regression
> > with NUMA grouping.
> >
> > It was followed up by these changesets:
> >
> > 53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
> > bea66fbd11af ("mm: numa: group related processes based on VMA flags
> > instead of page table flags")
> > b191f9b106ea ("mm: numa: preserve PTE write permissions across a
> > NUMA hinting fault")
> >
> > The first of those two changesets try alternate approaches to NUMA
> > grouping, which apparently do not work as well as looking at the
> > PTE
> > write permissions.
> >
> > The latter patch preserves the PTE write permissions across a NUMA
> > protection fault. However, it forgets to revert the condition for
> > whether or not to group tasks together back to what it was before
> > 3.19, even though the information is now preserved in the page
> > tables
> > once again.
> >
> > This patch brings the NUMA grouping heuristic back to what it was
> > before changeset 4d9424669946, which the changelogs of subsequent
> > changesets suggest worked best.
> >
> > We have all the information again. We should probably use it.
> >
>
> Patch looks ok other than the comment above the second hunk being out
> of
> date. Out of curiousity, what workload benefitted from this? I saw a
> mix
> of marginal results when I ran this on a 2-socket and 4-socket box.

I did not performance test the change, because I believe
the VM_WRITE test has a small logical error.

Specifically, VM_WRITE is also true for VMAs that are
PROT_WRITE|MAP_PRIVATE, which we do NOT want to group
on. Every shared library mapped on my system seems to
have a (small) read-write VMA:

00007f5adacff000ÂÂÂ1764K r-x-- libc-2.23.so
00007f5adaeb8000ÂÂÂ2044K ----- libc-2.23.so
00007f5adb0b7000ÂÂÂÂÂ16K r---- libc-2.23.so
00007f5adb0bb000ÂÂÂÂÂÂ8K rw--- libc-2.23.so

In other words, the code that is currently upstream
could result in programs being grouped into a numa
group due to accesses to libc.so, if they happened
to get started up right at the same time.

This will not catch many programs, since most of them
will have private copies of the pages in the small
read-write segments by the time other programs start
up, but it could catch a few of them.

Testing on VM_WRITE|VM_SHARED would solve that issue,
but at that point it would be essentially identical
to reverting the code to the old pte_write() test
that we had in 3.19 and before.

I do not expect the performance impact to be visible,
except when somebody gets very unlucky with application
startup timing.

--

All Rights Reversed.

Attachment: signature.asc
Description: This is a digitally signed message part