Re: NUMA_BALANCING and Xen PV guest regression in 3.20-rc0

From: Mel Gorman
Date: Thu Feb 19 2015 - 12:01:20 EST

Next message: Morten Rasmussen: "Re: [PATCH RESEND v9 04/10] sched: Make sched entity usage tracking scale-invariant"
Previous message: Jiri Bohac: "Re: [PATCH] time, ntp: Do not update time_state in middle of leap second [v3]"
In reply to: David Vrabel: "NUMA_BALANCING and Xen PV guest regression in 3.20-rc0"
Next in thread: Dario Faggioli: "Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Feb 19, 2015 at 01:06:53PM +0000, David Vrabel wrote:
> Mel,
>
> The NUMA_BALANCING series beginning with 5d833062139d (mm: numa: do not
> dereference pmd outside of the lock during NUMA hinting fault) and
> specifically 8a0516ed8b90 (mm: convert p[te|md]_numa users to
> p[te|md]_protnone_numa) breaks Xen 64-bit PV guests.
>
> Any fault on a present userspace mapping (e.g., a write to a read-only
> mapping) is being misinterpreted as a NUMA hinting fault and not handled
> correctly. All userspace programs end up continuously faulting.
>
> This is because the hypervisor sets _PAGE_GLOBAL (== _PAGE_PROTNONE) on
> all present userspace page table entries.
>

I see, this is a variation of the problem where the NUMA hinted PTE was
treated as special due to the paravirt interfaces not being used.

> Note that the comment in asm/pgtable_types.h that says that
> _PAGE_BIT_PROTNONE is only valid on non-present entries.
>
> /* If _PAGE_BIT_PRESENT is clear, we use these: */
> /* - if the user mapped it with PROT_NONE; pte_present gives true */
> #define _PAGE_BIT_PROTNONE _PAGE_BIT_GLOBAL
>
> Adjusting pte_protnone() and pmd_protnone() to check for the absence of
> _PAGE_PRESENT allows 64-bit Xen PV guests to work correctly again (see
> following patch), but I'm not sure if NUMA_BALANCING would correctly
> work with this change.
>

Thanks for the analysis and the reminder of some of the details from the
previous discussion.

>
> 8<---------------------------
> x86: pte_protnone() and pmd_protnone() must check entry is
> not present
>
> Since _PAGE_PROTNONE aliases _PAGE_GLOBAL it is only valid if
> _PAGE_PRESENT is clear. Make pte_protnone() and pmd_protnone() check
> for this.
>
> This fixes a 64-bit Xen PV guest regression introduced by
> 8a0516ed8b90c95ffa1363b420caa37418149f21 (mm: convert p[te|md]_numa
> users to p[te|md]_protnone_numa). Any userspace process would
> endlessly fault.
>
> In a 64-bit PV guest, userspace page table entries have _PAGE_GLOBAL
> set by the hypervisor. This meant that any fault on a present
> userspace entry (e.g., a write to a read-only mapping) would be
> misinterpreted as a NUMA hinting fault and the fault would not be
> correctly handled, resulting in the access endlessly faulting.
>
> Signed-off-by: David Vrabel <david.vrabel@xxxxxxxxxx>
> Cc: Mel Gorman <mgorman@xxxxxxx>

I cannot think of a reason why this would fail for NUMA balancing on bare
metal. The PAGE_NONE protection clears the present bit on p[te|md]_modify
so the expectations are matched before or after the patch is applied. So,
for bare metal at least

Acked-by: Mel Gorman <mgorman@xxxxxxx>

I *think* this will work ok with Xen but I cannot 100% convince myself.
I'm adding Wei Liu to the cc who may have a Xen PV setup handy that
supports NUMA and may be able to test the patch to confirm.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Morten Rasmussen: "Re: [PATCH RESEND v9 04/10] sched: Make sched entity usage tracking scale-invariant"
Previous message: Jiri Bohac: "Re: [PATCH] time, ntp: Do not update time_state in middle of leap second [v3]"
In reply to: David Vrabel: "NUMA_BALANCING and Xen PV guest regression in 3.20-rc0"
Next in thread: Dario Faggioli: "Re: [Xen-devel] NUMA_BALANCING and Xen PV guest regression in 3.20-rc0"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]