Avoid spurious page faults by avoiding pte_clear -> set pte

From: Christoph Lameter
Date: Thu Mar 31 2005 - 02:00:10 EST


The current way of updating ptes in the Linux vm includes first clearing
a pte before setting it to another value. The clearing is performed while
holding the page_table_lock to insure that the entry will not be modified
by the CPU directly (clearing the pte clears the present bit which
notifies the MMU that this entry is invalid) by an arch specific interrupt
handler or another page fault handler running on another CPU. This approach
is necessary for some architectures that cannot perform atomic updates of
page table entries.

If a page table entry is cleared then a second CPU may generate a page fault
for that entry. The fault handler on the second CPU will then attempt to
acquire the page_table_lock and wait until the first CPU has completed
updating the page table entry. The fault handler on the second CPU will then
discover that everything is ok and simply do nothing (apart from incrementing
the counters for a minor fault and marking the page again as accessed).

However, most architectures actually support atomic operations on page
table entries. The use of atomic operations on page table entries would
allow the update of a page table entry in a single atomic operation instead
of writing to the page table entry twice. There would also be no danger of
generating a spurious page fault on other CPUs.

The following patch introduces two new atomic operations ptep_xchg and
ptep_cmpxchg that may be provided by an architecture. The fallback in
include/asm-generic/pgtable.h is to simulate both operations through the
existing ptep_get_and_clear function. So there is essentially no change if
atomic operations on ptes have not been defined. Architectures that do
not support atomic operations on ptes may continue to use the clearing of
a pte for locking type purposes.

Atomic operations are enabled for i386, ia64 and x86_64 if a suitable
CPU is configured in SMP mode. Generic atomic definitions for ptep_xchg
and ptep_cmpxchg have been provided based on the existing xchg() and
cmpxchg() functions that already work atomically on many platforms.

The provided generic atomic functions may be overridden as usual by defining
the appropriate__HAVE_ARCH_xxx constant and providing a different
implementation.

The attempt to reduce the use of the page_table_lock in the page fault handler
through atomic operations relies on a pte never being clear if the pte is in
use even when the page_table_lock is not held. Clearing a pte before setting
it to another value could result in a situation in which a fault generated by
another cpu could install a pte which is then immediately overwritten by
the first CPU setting the pte to a valid value again. This patch is
important for future work on reducing the use of spinlocks in the vm through
atomic operations on pte's.

AIM7 Benchmark on an 8 processor system:

w/o patch
Tasks jobs/min jti jobs/min/task real cpu
1 471.79 100 471.7899 12.34 2.30 Wed Mar 30 21:29:54 2005
100 18068.92 89 180.6892 32.21 158.37 Wed Mar 30 21:30:27 2005
200 21427.39 84 107.1369 54.32 315.84 Wed Mar 30 21:31:22 2005
300 21500.87 82 71.6696 81.21 473.74 Wed Mar 30 21:32:43 2005
400 24886.42 83 62.2160 93.55 633.73 Wed Mar 30 21:34:23 2005
500 25658.89 81 51.3178 113.41 789.44 Wed Mar 30 21:36:17 2005
600 25693.47 81 42.8225 135.91 949.00 Wed Mar 30 21:38:33 2005
700 26098.32 80 37.2833 156.10 1108.17 Wed Mar 30 21:41:10 2005
800 26334.25 80 32.9178 176.80 1266.73 Wed Mar 30 21:44:07 2005
900 26913.85 80 29.9043 194.62 1422.11 Wed Mar 30 21:47:22 2005
1000 26749.89 80 26.7499 217.57 1583.95 Wed Mar 30 21:51:01 2005

w/patch:
Tasks jobs/min jti jobs/min/task real cpu
1 470.30 100 470.3030 12.38 2.33 Wed Mar 30 21:57:27 2005
100 18465.05 89 184.6505 31.52 158.62 Wed Mar 30 21:57:58 2005
200 22399.26 86 111.9963 51.97 315.95 Wed Mar 30 21:58:51 2005
300 24274.61 84 80.9154 71.93 475.04 Wed Mar 30 22:00:03 2005
400 25120.86 82 62.8021 92.67 634.10 Wed Mar 30 22:01:36 2005
500 25742.87 81 51.4857 113.04 791.13 Wed Mar 30 22:03:30 2005
600 26322.73 82 43.8712 132.66 948.31 Wed Mar 30 22:05:43 2005
700 25718.40 80 36.7406 158.41 1112.30 Wed Mar 30 22:08:22 2005
800 26361.08 80 32.9514 176.62 1269.94 Wed Mar 30 22:11:19 2005
900 26975.67 81 29.9730 194.17 1424.56 Wed Mar 30 22:14:33 2005
1000 26765.51 80 26.7655 217.44 1585.27 Wed Mar 30 22:18:12 2005

There are some minor performance improvements and some minimal losses for other
numbers of tasks. The improvement may be due to the avoidance of one store
and the avoidance of useless page faults.

Signed-off-by: Christoph Lameter <clameter@xxxxxxx>

Index: linux-2.6.11/mm/rmap.c
===================================================================
--- linux-2.6.11.orig/mm/rmap.c 2005-03-30 20:32:21.000000000 -0800
+++ linux-2.6.11/mm/rmap.c 2005-03-30 22:40:47.000000000 -0800
@@ -574,11 +574,6 @@ static int try_to_unmap_one(struct page

/* Nuke the page table entry. */
flush_cache_page(vma, address, page_to_pfn(page));
- pteval = ptep_clear_flush(vma, address, pte);
-
- /* Move the dirty bit to the physical page now the pte is gone. */
- if (pte_dirty(pteval))
- set_page_dirty(page);

if (PageAnon(page)) {
swp_entry_t entry = { .val = page->private };
@@ -593,10 +588,15 @@ static int try_to_unmap_one(struct page
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
+ pteval = ptep_xchg_flush(vma, address, pte, swp_entry_to_pte(entry));
BUG_ON(pte_file(*pte));
dec_mm_counter(mm, anon_rss);
- }
+ } else
+ pteval = ptep_clear_flush(vma, address, pte);
+
+ /* Move the dirty bit to the physical page now the pte is gone. */
+ if (pte_dirty(pteval))
+ set_page_dirty(page);

inc_mm_counter(mm, rss);
page_remove_rmap(page);
@@ -689,15 +689,15 @@ static void try_to_unmap_cluster(unsigne
if (ptep_clear_flush_young(vma, address, pte))
continue;

- /* Nuke the page table entry. */
flush_cache_page(vma, address, pfn);
- pteval = ptep_clear_flush(vma, address, pte);

/* If nonlinear, store the file page offset in the pte. */
if (page->index != linear_page_index(vma, address))
- set_pte_at(mm, address, pte, pgoff_to_pte(page->index));
+ pteval = ptep_xchg_flush(vma, address, pte, pgoff_to_pte(page->index));
+ else
+ pteval = ptep_clear_flush(vma, address, pte);

- /* Move the dirty bit to the physical page now the pte is gone. */
+ /* Move the dirty bit to the physical page now that the pte is gone. */
if (pte_dirty(pteval))
set_page_dirty(page);

Index: linux-2.6.11/mm/mprotect.c
===================================================================
--- linux-2.6.11.orig/mm/mprotect.c 2005-03-30 20:32:21.000000000 -0800
+++ linux-2.6.11/mm/mprotect.c 2005-03-30 22:42:51.000000000 -0800
@@ -32,17 +32,19 @@ static void change_pte_range(struct mm_s

pte = pte_offset_map(pmd, addr);
do {
- if (pte_present(*pte)) {
- pte_t ptent;
+ pte_t ptent;
+redo:
+ ptent = *pte;
+ if (!pte_present(ptent))
+ continue;

- /* Avoid an SMP race with hardware updated dirty/clean
- * bits by wiping the pte and then setting the new pte
- * into place.
- */
- ptent = pte_modify(ptep_get_and_clear(mm, addr, pte), newprot);
- set_pte_at(mm, addr, pte, ptent);
- lazy_mmu_prot_update(ptent);
- }
+ /* Deal with a potential SMP race with hardware/arch
+ * interrupt updating dirty/clean bits through the use
+ * of ptep_cmpxchg.
+ */
+ if (!ptep_cmpxchg(mm, addr, pte, ptent, pte_modify(ptent, newprot)))
+ goto redo;
+ lazy_mmu_prot_update(ptent);
} while (pte++, addr += PAGE_SIZE, addr != end);
pte_unmap(pte - 1);
}
Index: linux-2.6.11/include/asm-generic/pgtable.h
===================================================================
--- linux-2.6.11.orig/include/asm-generic/pgtable.h 2005-03-30 20:32:08.000000000 -0800
+++ linux-2.6.11/include/asm-generic/pgtable.h 2005-03-30 22:40:47.000000000 -0800
@@ -111,6 +111,92 @@ do { \
})
#endif

+#ifdef CONFIG_ATOMIC_TABLE_OPS
+
+/*
+ * The architecture does support atomic table operations.
+ * We may be able to provide atomic ptep_xchg and ptep_cmpxchg using
+ * cmpxchg and xchg.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval) \
+ __pte(xchg(&pte_val(*(__ptep)), pte_val(__pteval)))
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep,__oldval,__newval) \
+ (cmpxchg(&pte_val(*(__ptep)), \
+ pte_val(__oldval), \
+ pte_val(__newval) \
+ ) == pte_val(__oldval) \
+ )
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg(__vma, __address, __ptep, __pteval); \
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+
+#else
+
+/*
+ * No support for atomic operations on the page table.
+ * Exchanging of pte values is done by first swapping zeros into
+ * a pte and then putting new content into the pte entry.
+ * However, these functions will generate an empty pte for a
+ * short time frame. This means that the page_table_lock must be held
+ * to avoid a page fault that would install a new entry.
+ */
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg(__mm, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_get_and_clear(__mm, __address, __ptep); \
+ set_pte_at(__mm, __address, __ptep, __pteval); \
+ __pte; \
+})
+#endif
+
+#ifndef __HAVE_ARCH_PTEP_XCHG_FLUSH
+#ifndef __HAVE_ARCH_PTEP_XCHG
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_clear_flush(__vma, __address, __ptep); \
+ set_pte_at((__vma)->mm, __address, __ptep, __pteval); \
+ __pte; \
+})
+#else
+#define ptep_xchg_flush(__vma, __address, __ptep, __pteval) \
+({ \
+ pte_t __pte = ptep_xchg((__vma)->mm, __address, __ptep, __pteval);\
+ flush_tlb_page(__vma, __address); \
+ __pte; \
+})
+#endif
+#endif
+
+/*
+ * The fallback function for ptep_cmpxchg avoids any real use of cmpxchg
+ * since cmpxchg may not be available on certain architectures. Instead
+ * the clearing of a pte is used as a form of locking mechanism.
+ * This approach will only work if the page_table_lock is held to insure
+ * that the pte is not populated by a page fault generated on another
+ * CPU.
+ */
+#ifndef __HAVE_ARCH_PTEP_CMPXCHG
+#define ptep_cmpxchg(__mm, __address, __ptep, __old, __new) \
+({ \
+ pte_t prev = ptep_get_and_clear(__mm, __address, __ptep); \
+ int r = pte_val(prev) == pte_val(__old); \
+ set_pte_at(__mm, __address, __ptep, r ? (__new) : prev); \
+ r; \
+})
+#endif
+#endif
+
#ifndef __HAVE_ARCH_PTEP_SET_WRPROTECT
static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long address, pte_t *ptep)
{
Index: linux-2.6.11/arch/ia64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/ia64/Kconfig 2005-03-01 23:38:26.000000000 -0800
+++ linux-2.6.11/arch/ia64/Kconfig 2005-03-30 22:40:47.000000000 -0800
@@ -272,6 +272,11 @@ config PREEMPT
Say Y here if you are building a kernel for a desktop, embedded
or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool
+ depends on SMP
+ default y
+
config HAVE_DEC_LOCK
bool
depends on (SMP || PREEMPT)
Index: linux-2.6.11/arch/i386/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/i386/Kconfig 2005-03-30 20:31:27.000000000 -0800
+++ linux-2.6.11/arch/i386/Kconfig 2005-03-30 22:40:47.000000000 -0800
@@ -886,6 +886,11 @@ config HAVE_DEC_LOCK
depends on (SMP || PREEMPT) && X86_CMPXCHG
default y

+config ATOMIC_TABLE_OPS
+ bool
+ depends on SMP && X86_CMPXCHG && !X86_PAE
+ default y
+
# turning this on wastes a bunch of space.
# Summit needs it only when NUMA is on
config BOOT_IOREMAP
Index: linux-2.6.11/arch/x86_64/Kconfig
===================================================================
--- linux-2.6.11.orig/arch/x86_64/Kconfig 2005-03-30 20:31:40.000000000 -0800
+++ linux-2.6.11/arch/x86_64/Kconfig 2005-03-30 22:40:47.000000000 -0800
@@ -223,6 +223,11 @@ config PREEMPT
Say Y here if you are feeling brave and building a kernel for a
desktop, embedded or real-time system. Say N if you are unsure.

+config ATOMIC_TABLE_OPS
+ bool
+ depends on SMP
+ default y
+
config PREEMPT_BKL
bool "Preempt The Big Kernel Lock"
depends on PREEMPT
Index: linux-2.6.11/mm/memory.c
===================================================================
--- linux-2.6.11.orig/mm/memory.c 2005-03-30 20:32:21.000000000 -0800
+++ linux-2.6.11/mm/memory.c 2005-03-30 22:40:47.000000000 -0800
@@ -463,15 +463,19 @@ static void zap_pte_range(struct mmu_gat
page->index > details->last_index))
continue;
}
- ptent = ptep_get_and_clear(tlb->mm, addr, pte);
- tlb_remove_tlb_entry(tlb, pte, addr);
- if (unlikely(!page))
+ if (unlikely(!page)) {
+ ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+ tlb_remove_tlb_entry(tlb, pte, addr);
continue;
+ }
if (unlikely(details) && details->nonlinear_vma
&& linear_page_index(details->nonlinear_vma,
addr) != page->index)
- set_pte_at(tlb->mm, addr, pte,
- pgoff_to_pte(page->index));
+ ptent = ptep_xchg(tlb->mm, addr, pte,
+ pgoff_to_pte(page->index));
+ else
+ ptent = ptep_get_and_clear(tlb->mm, addr, pte);
+ tlb_remove_tlb_entry(tlb, pte, addr);
if (pte_dirty(ptent))
set_page_dirty(page);
if (PageAnon(page))
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/