Re: [PATCH v3 00/20] Speculative page faults

From: Laurent Dufour
Date: Mon Sep 18 2017 - 03:15:42 EST


Despite the unprovable lockdep warning raised by Sergey, I didn't get any
feedback on this series.

Is there a chance to get it moved upstream ?

Thanks,
Laurent.

On 08/09/2017 20:06, Laurent Dufour wrote:
> This is a port on kernel 4.13 of the work done by Peter Zijlstra to
> handle page fault without holding the mm semaphore [1].
>
> The idea is to try to handle user space page faults without holding the
> mmap_sem. This should allow better concurrency for massively threaded
> process since the page fault handler will not wait for other threads memory
> layout change to be done, assuming that this change is done in another part
> of the process's memory space. This type page fault is named speculative
> page fault. If the speculative page fault fails because of a concurrency is
> detected or because underlying PMD or PTE tables are not yet allocating, it
> is failing its processing and a classic page fault is then tried.
>
> The speculative page fault (SPF) has to look for the VMA matching the fault
> address without holding the mmap_sem, so the VMA list is now managed using
> SRCU allowing lockless walking. The only impact would be the deferred file
> derefencing in the case of a file mapping, since the file pointer is
> released once the SRCU cleaning is done. This patch relies on the change
> done recently by Paul McKenney in SRCU which now runs a callback per CPU
> instead of per SRCU structure [1].
>
> The VMA's attributes checked during the speculative page fault processing
> have to be protected against parallel changes. This is done by using a per
> VMA sequence lock. This sequence lock allows the speculative page fault
> handler to fast check for parallel changes in progress and to abort the
> speculative page fault in that case.
>
> Once the VMA is found, the speculative page fault handler would check for
> the VMA's attributes to verify that the page fault has to be handled
> correctly or not. Thus the VMA is protected through a sequence lock which
> allows fast detection of concurrent VMA changes. If such a change is
> detected, the speculative page fault is aborted and a *classic* page fault
> is tried. VMA sequence locks are added when VMA attributes which are
> checked during the page fault are modified.
>
> When the PTE is fetched, the VMA is checked to see if it has been changed,
> so once the page table is locked, the VMA is valid, so any other changes
> leading to touching this PTE will need to lock the page table, so no
> parallel change is possible at this time.
>
> Compared to the Peter's initial work, this series introduces a spin_trylock
> when dealing with speculative page fault. This is required to avoid dead
> lock when handling a page fault while a TLB invalidate is requested by an
> other CPU holding the PTE. Another change due to a lock dependency issue
> with mapping->i_mmap_rwsem.
>
> In addition some VMA field values which are used once the PTE is unlocked
> at the end the page fault path are saved into the vm_fault structure to
> used the values matching the VMA at the time the PTE was locked.
>
> This series only support VMA with no vm_ops define, so huge page and mapped
> file are not managed with the speculative path. In addition transparent
> huge page are not supported. Once this series will be accepted upstream
> I'll extend the support to mapped files, and transparent huge pages.
>
> This series builds on top of v4.13.9-mm1 and is functional on x86 and
> PowerPC.
>
> Tests have been made using a large commercial in-memory database on a
> PowerPC system with 752 CPU using RFC v5 using a previous version of this
> series. The results are very encouraging since the loading of the 2TB
> database was faster by 14% with the speculative page fault.
>
> Using ebizzy test [3], which spreads a lot of threads, the result are good
> when running on both a large or a small system. When using kernbench, the
> result are quite similar which expected as not so much multithreaded
> processes are involved. But there is no performance degradation neither
> which is good.
>
> ------------------
> Benchmarks results
>
> Note these test have been made on top of 4.13.0-mm1.
>
> Ebizzy:
> -------
> The test is counting the number of records per second it can manage, the
> higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
> result I repeated the test 100 times and measure the average result, mean
> deviation, max and min.
>
> - 16 CPUs x86 VM
> Records/s 4.13.0-mm1 4.13.0-mm1-spf delta
> Average 13217.90 65765.94 +397.55%
> Mean deviation 690.37 2609.36 +277.97%
> Max 16726 77675 +364.40%
> Min 12194 616340 +405.45%
>
> - 80 CPUs Power 8 node:
> Records/s 4.13.0-mm1 4.13.0-mm1-spf delta
> Average 38175.40 67635.55 77.17%
> Mean deviation 600.09 2349.66 291.55%
> Max 39563 74292 87.78%
> Min 35846 62657 74.79%
>
> The number of record per second is far better with the speculative page
> fault.
> The mean deviation is higher with the speculative page fault, may be
> because sometime the fault are not handled in a speculative way leading to
> more variation.
> The numbers for the x86 guest are really insane for the SPF case, but I
> did the test several times and this leads each time this delta. I did again
> the test using the previous version of the patch and I got similar
> numbers. It happens that the host running the VM is far less loaded now
> leading to better results as more threads are eligible to run.
> Test on Power are done on a badly balanced node where the memory is only
> attached to one core.
>
> Kernbench:
> ----------
> This test is building a 4.12 kernel using platform default config. The
> build has been run 5 times each time.
>
> - 16 CPUs x86 VM
> Average Half load -j 8 Run (std deviation)
> 4.13.0-mm1 4.13.0-mm1-spf delta %
> Elapsed Time 145.968 (0.402206) 145.654 (0.533601) -0.22
> User Time 1006.58 (2.74729) 1003.7 (4.11294) -0.29
> System Time 108.464 (0.177567) 111.034 (0.718213) +2.37
> Percent CPU 763.4 (1.34164) 764.8 (1.30384) +0.18
> Context Switches 46599.6 (412.013) 63771 (1049.95) +36.85
> Sleeps 85313.2 (514.456) 85532.2 (681.199) -0.26
>
> Average Optimal load -j 16 Run (std deviation)
> 4.13.0-mm1 4.13.0-mm1-spf delta %
> Elapsed Time 74.292 (0.75998) 74.484 (0.723035) +0.26
> User Time 959.949 (49.2036) 956.057 (50.2993) -0.41
> System Time 100.203 (8.7119) 101.984 (9.56099) +1.78
> Percent CPU 1058 (310.661) 1054.3 (305.263) -0.35
> Context Switches 65713.8 (20161.7) 86619.4 (24095.4) +31.81
> Sleeps 90344.9 (5364.74) 90877.4 (5655.87) -0.59
>
> The elapsed time are similar, but the impact less important since there are
> less multithreaded processes involved here.
>
> - 80 CPUs Power 8 node:
> Average Half load -j 40 Run (std deviation)
> 4.13.0-mm1 4.13.0-mm1-spf delta %
> Elapsed Time 115.342 (0.321668) 115.786 (0.427118) +0.38
> User Time 4355.08 (10.1778) 4371.77 (14.9715) +0.38
> System Time 127.612 (0.882083) 130.048 (1.06258) +1.91
> Percent CPU 3885.8 (11.606) 3887.4 (8.04984) +0.04
> Context Switches 80907.8 (657.481) 81936.4 (729.538) +1.27
> Sleeps 162109 (793.331) 162057 (1414.08) +0.03
>
> Average Optimal load -j 80 Run (std deviation)
> 4.13.0-mm1 4.13.0-mm1-spf
> Elapsed Time 110.308 (0.725445) 109.78 (0.826862) -0.48
> User Time 5893.12 (1621.33) 5923.19 (1635.48) +0.51
> System Time 162.168 (36.4347) 166.533 (38.4695) +2.69
> Percent CPU 5400.2 (1596.89) 5440.4 (1637.71) +0.74
> Context Switches 129372 (51088.2) 144529 (65985.5) +11.72
> Sleeps 157312 (5113.57) 158696 (4301.48) -0.87
>
> Here the elapsed time are similar the SPF release, but we remain in the error
> margin. It has to be noted that this system is not correctly balanced on
> the NUMA point of view as all the available memory is attached to one core.
>
> ------------------------
> Changes since v2:
> - Perf event is renamed in PERF_COUNT_SW_SPF
> - On Power handle do_page_fault()'s cleaning
> - On Power if the VM_FAULT_ERROR is returned by
> handle_speculative_fault(), do not retry but jump to the error path
> - If VMA's flags are not matching the fault, directly returns
> VM_FAULT_SIGSEGV and not VM_FAULT_RETRY
> - Check for pud_trans_huge() to avoid speculative path
> - Handles _vm_normal_page()'s introduced by 6f16211df3bf
> ("mm/device-public-memory: device memory cache coherent with CPU")
> - add and review few comments in the code
> Changes since v1:
> - Remove PERF_COUNT_SW_SPF_FAILED perf event.
> - Add tracing events to details speculative page fault failures.
> - Cache VMA fields values which are used once the PTE is unlocked at the
> end of the page fault events.
> - Ensure that fields read during the speculative path are written and read
> using WRITE_ONCE and READ_ONCE.
> - Add checks at the beginning of the speculative path to abort it if the
> VMA is known to not be supported.
> Changes since RFC V5 [5]
> - Port to 4.13 kernel
> - Merging patch fixing lock dependency into the original patch
> - Replace the 2 parameters of vma_has_changed() with the vmf pointer
> - In patch 7, don't call __do_fault() in the speculative path as it may
> want to unlock the mmap_sem.
> - In patch 11-12, don't check for vma boundaries when
> page_add_new_anon_rmap() is called during the spf path and protect against
> anon_vma pointer's update.
> - In patch 13-16, add performance events to report number of successful
> and failed speculative events.
>
> [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__linux-2Dkernel.2935.n7.nabble.com_RFC-2DPATCH-2D0-2D6-2DAnother-2Dgo-2Dat-2Dspeculative-2Dpage-2Dfaults-2Dtt965642.html-23none&d=DwIBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=WE1-GjEMX6XRg4v6rPpC0RVdhh4z63Csy-Wmu5dgUp0&m=449ThuJ31DP_64d96xAqLlSqq4qgY5LlJvzwiULSaos&s=9wDEbeKddqKRa0zfN13yjrErkFIQJo9Ohe07I7IuBSk&e=
> [2] https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dda915ad5cf25b5f5d358dd3670c3378d8ae8c03e&d=DwIBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=WE1-GjEMX6XRg4v6rPpC0RVdhh4z63Csy-Wmu5dgUp0&m=449ThuJ31DP_64d96xAqLlSqq4qgY5LlJvzwiULSaos&s=OUT_ItjCInfCHdZQS5cjmxUQd3Ws8VkT54MZgJm2dAE&e=
> [3] https://urldefense.proofpoint.com/v2/url?u=http-3A__ebizzy.sourceforge.net_&d=DwIBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=WE1-GjEMX6XRg4v6rPpC0RVdhh4z63Csy-Wmu5dgUp0&m=449ThuJ31DP_64d96xAqLlSqq4qgY5LlJvzwiULSaos&s=cMZB09rj1TqCKM2B3DPrtrB1LpZan637kvHrM6ShaDk&e=
> [4] https://urldefense.proofpoint.com/v2/url?u=http-3A__ck.kolivas.org_apps_kernbench_kernbench-2D0.50_&d=DwIBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=WE1-GjEMX6XRg4v6rPpC0RVdhh4z63Csy-Wmu5dgUp0&m=449ThuJ31DP_64d96xAqLlSqq4qgY5LlJvzwiULSaos&s=2D_JH8n0pGF5lE0jSXnb2RY5etKV7C7UfO7-8hknJDE&e=
> [5] https://urldefense.proofpoint.com/v2/url?u=https-3A__lwn.net_Articles_725607_&d=DwIBAg&c=jf_iaSHvJObTbx-siA1ZOg&r=WE1-GjEMX6XRg4v6rPpC0RVdhh4z63Csy-Wmu5dgUp0&m=449ThuJ31DP_64d96xAqLlSqq4qgY5LlJvzwiULSaos&s=CEgoZjaMNHIZFX-XAzuzr8EswsKhQAArNwmc_8bnduA&e=
>
> Laurent Dufour (14):
> mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
> mm: Protect VMA modifications using VMA sequence count
> mm: Cache some VMA fields in the vm_fault structure
> mm: Protect SPF handler against anon_vma changes
> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
> mm: Introduce __lru_cache_add_active_or_unevictable
> mm: Introduce __maybe_mkwrite()
> mm: Introduce __vm_normal_page()
> mm: Introduce __page_add_new_anon_rmap()
> mm: Try spin lock in speculative path
> mm: Adding speculative page fault failure trace events
> perf: Add a speculative page fault sw event
> perf tools: Add support for the SPF perf event
> powerpc/mm: Add speculative page fault
>
> Peter Zijlstra (6):
> mm: Dont assume page-table invariance during faults
> mm: Prepare for FAULT_FLAG_SPECULATIVE
> mm: VMA sequence count
> mm: RCU free VMAs
> mm: Provide speculative fault infrastructure
> x86/mm: Add speculative pagefault handling
>
> arch/powerpc/include/asm/book3s/64/pgtable.h | 5 +
> arch/powerpc/mm/fault.c | 15 +
> arch/x86/include/asm/pgtable_types.h | 7 +
> arch/x86/mm/fault.c | 19 ++
> fs/proc/task_mmu.c | 5 +-
> fs/userfaultfd.c | 17 +-
> include/linux/hugetlb_inline.h | 2 +-
> include/linux/migrate.h | 4 +-
> include/linux/mm.h | 28 +-
> include/linux/mm_types.h | 3 +
> include/linux/pagemap.h | 4 +-
> include/linux/rmap.h | 12 +-
> include/linux/swap.h | 11 +-
> include/trace/events/pagefault.h | 87 +++++
> include/uapi/linux/perf_event.h | 1 +
> kernel/fork.c | 1 +
> mm/hugetlb.c | 2 +
> mm/init-mm.c | 1 +
> mm/internal.h | 19 ++
> mm/khugepaged.c | 5 +
> mm/madvise.c | 6 +-
> mm/memory.c | 478 ++++++++++++++++++++++-----
> mm/mempolicy.c | 51 ++-
> mm/migrate.c | 4 +-
> mm/mlock.c | 13 +-
> mm/mmap.c | 138 ++++++--
> mm/mprotect.c | 4 +-
> mm/mremap.c | 7 +
> mm/rmap.c | 5 +-
> mm/swap.c | 12 +-
> tools/include/uapi/linux/perf_event.h | 1 +
> tools/perf/util/evsel.c | 1 +
> tools/perf/util/parse-events.c | 4 +
> tools/perf/util/parse-events.l | 1 +
> tools/perf/util/python.c | 1 +
> 35 files changed, 796 insertions(+), 178 deletions(-)
> create mode 100644 include/trace/events/pagefault.h
>