[PATCH 04/52] mm/numa: define _PAGE_NUMA

From: Ingo Molnar
Date: Sun Dec 02 2012 - 13:55:44 EST


From: Andrea Arcangeli <aarcange@xxxxxxxxxx>

The objective of _PAGE_NUMA is to be able to trigger NUMA
hinting page faults to identify the per NUMA node working set of
the thread at runtime.

Arming the NUMA hinting page fault mechanism works similarly to
setting up a mprotect(PROT_NONE) virtual range: the present bit
is cleared at the same time that _PAGE_NUMA is set, so when the
fault triggers we can identify it as a NUMA hinting page fault.

_PAGE_NUMA on x86 shares the same bit number of _PAGE_PROTNONE
(but it could also use a different bitflag, it's up to the
architecture to decide).

It would be confusing to call the "NUMA hinting page faults" as
"do_prot_none faults". They're different events and _PAGE_NUMA
doesn't alter the semantics of mprotect(PROT_NONE) in any way.

Sharing the same bitflag with _PAGE_PROTNONE in fact complicates
things: it requires us to ensure the code paths executed by
_PAGE_PROTNONE remains mutually exclusive to the code paths
executed by _PAGE_NUMA at all times, to avoid _PAGE_NUMA and
_PAGE_PROTNONE to step into each other toes.

Because we want to be able to set this bitflag in any
established pte or pmd (while clearing the present bit at the
same time) without losing information, this bitflag must never
be set when the pte and pmd are present, so the bitflag picked
for _PAGE_NUMA usage, must not be used by the swap entry format.

Signed-off-by: Andrea Arcangeli <aarcange@xxxxxxxxxx>
Signed-off-by: Mel Gorman <mgorman@xxxxxxx>
Reviewed-by: Rik van Riel <riel@xxxxxxxxxx>
Cc: Johannes Weiner <hannes@xxxxxxxxxxx>
Cc: Hugh Dickins <hughd@xxxxxxxxxx>
Cc: Paul Turner <pjt@xxxxxxxxxx>
Cc: Lee Schermerhorn <Lee.Schermerhorn@xxxxxx>
Cc: Alex Shi <lkml.alex@xxxxxxxxx>
Cc: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
Cc: Aneesh Kumar <aneesh.kumar@xxxxxxxxxxxxxxxxxx>
Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
Cc: Peter Zijlstra <a.p.zijlstra@xxxxxxxxx>
Signed-off-by: Ingo Molnar <mingo@xxxxxxxxxx>
---
arch/x86/include/asm/pgtable_types.h | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index ec8a1fc..3c32db8 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -64,6 +64,26 @@
#define _PAGE_FILE (_AT(pteval_t, 1) << _PAGE_BIT_FILE)
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)

+/*
+ * _PAGE_NUMA indicates that this page will trigger a numa hinting
+ * minor page fault to gather numa placement statistics (see
+ * pte_numa()). The bit picked (8) is within the range between
+ * _PAGE_FILE (6) and _PAGE_PROTNONE (8) bits. Therefore, it doesn't
+ * require changes to the swp entry format because that bit is always
+ * zero when the pte is not present.
+ *
+ * The bit picked must be always zero when the pmd is present and not
+ * present, so that we don't lose information when we set it while
+ * atomically clearing the present bit.
+ *
+ * Because we shared the same bit (8) with _PAGE_PROTNONE this can be
+ * interpreted as _PAGE_NUMA only in places that _PAGE_PROTNONE
+ * couldn't reach, like handle_mm_fault() (see access_error in
+ * arch/x86/mm/fault.c, the vma protection must not be PROT_NONE for
+ * handle_mm_fault() to be invoked).
+ */
+#define _PAGE_NUMA _PAGE_PROTNONE
+
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | \
_PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | \
--
1.7.11.7

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/