Re: [PATCH v9 00/24] Speculative page faults
From: Laurent Dufour
Date: Thu Mar 29 2018 - 08:51:31 EST
On 22/03/2018 02:21, Ganesh Mahendran wrote:
> Hi, Laurent
>
> 2018-03-14 1:59 GMT+08:00 Laurent Dufour <ldufour@xxxxxxxxxxxxxxxxxx>:
>> This is a port on kernel 4.16 of the work done by Peter Zijlstra to
>> handle page fault without holding the mm semaphore [1].
>>
>> The idea is to try to handle user space page faults without holding the
>> mmap_sem. This should allow better concurrency for massively threaded
>> process since the page fault handler will not wait for other threads memory
>> layout change to be done, assuming that this change is done in another part
>> of the process's memory space. This type page fault is named speculative
>> page fault. If the speculative page fault fails because of a concurrency is
>> detected or because underlying PMD or PTE tables are not yet allocating, it
>> is failing its processing and a classic page fault is then tried.
>>
>> The speculative page fault (SPF) has to look for the VMA matching the fault
>> address without holding the mmap_sem, this is done by introducing a rwlock
>> which protects the access to the mm_rb tree. Previously this was done using
>> SRCU but it was introducing a lot of scheduling to process the VMA's
>> freeing
>> operation which was hitting the performance by 20% as reported by Kemi Wang
>> [2].Using a rwlock to protect access to the mm_rb tree is limiting the
>> locking contention to these operations which are expected to be in a O(log
>> n)
>> order. In addition to ensure that the VMA is not freed in our back a
>> reference count is added and 2 services (get_vma() and put_vma()) are
>> introduced to handle the reference count. When a VMA is fetch from the RB
>> tree using get_vma() is must be later freeed using put_vma(). Furthermore,
>> to allow the VMA to be used again by the classic page fault handler a
>> service is introduced can_reuse_spf_vma(). This service is expected to be
>> called with the mmap_sem hold. It checked that the VMA is still matching
>> the specified address and is releasing its reference count as the mmap_sem
>> is hold it is ensure that it will not be freed in our back. In general, the
>> VMA's reference count could be decremented when holding the mmap_sem but it
>> should not be increased as holding the mmap_sem is ensuring that the VMA is
>> stable. I can't see anymore the overhead I got while will-it-scale
>> benchmark anymore.
>>
>> The VMA's attributes checked during the speculative page fault processing
>> have to be protected against parallel changes. This is done by using a per
>> VMA sequence lock. This sequence lock allows the speculative page fault
>> handler to fast check for parallel changes in progress and to abort the
>> speculative page fault in that case.
>>
>> Once the VMA is found, the speculative page fault handler would check for
>> the VMA's attributes to verify that the page fault has to be handled
>> correctly or not. Thus the VMA is protected through a sequence lock which
>> allows fast detection of concurrent VMA changes. If such a change is
>> detected, the speculative page fault is aborted and a *classic* page fault
>> is tried. VMA sequence lockings are added when VMA attributes which are
>> checked during the page fault are modified.
>>
>> When the PTE is fetched, the VMA is checked to see if it has been changed,
>> so once the page table is locked, the VMA is valid, so any other changes
>> leading to touching this PTE will need to lock the page table, so no
>> parallel change is possible at this time.
>>
>> The locking of the PTE is done with interrupts disabled, this allows to
>> check for the PMD to ensure that there is not an ongoing collapsing
>> operation. Since khugepaged is firstly set the PMD to pmd_none and then is
>> waiting for the other CPU to have catch the IPI interrupt, if the pmd is
>> valid at the time the PTE is locked, we have the guarantee that the
>> collapsing opertion will have to wait on the PTE lock to move foward. This
>> allows the SPF handler to map the PTE safely. If the PMD value is different
>> than the one recorded at the beginning of the SPF operation, the classic
>> page fault handler will be called to handle the operation while holding the
>> mmap_sem. As the PTE lock is done with the interrupts disabled, the lock is
>> done using spin_trylock() to avoid dead lock when handling a page fault
>> while a TLB invalidate is requested by an other CPU holding the PTE.
>>
>> Support for THP is not done because when checking for the PMD, we can be
>> confused by an in progress collapsing operation done by khugepaged. The
>> issue is that pmd_none() could be true either if the PMD is not already
>> populated or if the underlying PTE are in the way to be collapsed. So we
>> cannot safely allocate a PMD if pmd_none() is true.
>>
>> This series a new software performance event named 'speculative-faults' or
>> 'spf'. It counts the number of successful page fault event handled in a
>> speculative way. When recording 'faults,spf' events, the faults one is
>> counting the total number of page fault events while 'spf' is only counting
>> the part of the faults processed in a speculative way.
>>
>> There are some trace events introduced by this series. They allow to
>> identify why the page faults where not processed in a speculative way. This
>> doesn't take in account the faults generated by a monothreaded process
>> which directly processed while holding the mmap_sem. This trace events are
>> grouped in a system named 'pagefault', they are:
>> - pagefault:spf_pte_lock : if the pte was already locked by another thread
>> - pagefault:spf_vma_changed : if the VMA has been changed in our back
>> - pagefault:spf_vma_noanon : the vma->anon_vma field was not yet set.
>> - pagefault:spf_vma_notsup : the VMA's type is not supported
>> - pagefault:spf_vma_access : the VMA's access right are not respected
>> - pagefault:spf_pmd_changed : the upper PMD pointer has changed in our
>> back.
>>
>> To record all the related events, the easier is to run perf with the
>> following arguments :
>> $ perf stat -e 'faults,spf,pagefault:*' <command>
>>
>> This series builds on top of v4.16-rc2-mmotm-2018-02-21-14-48 and is
>> functional on x86 and PowerPC.
>>
>> ---------------------
>> Real Workload results
>>
>> As mentioned in previous email, we did non official runs using a "popular
>> in memory multithreaded database product" on 176 cores SMT8 Power system
>> which showed a 30% improvements in the number of transaction processed per
>> second. This run has been done on the v6 series, but changes introduced in
>> this new verion should not impact the performance boost seen.
>>
>> Here are the perf data captured during 2 of these runs on top of the v8
>> series:
>> vanilla spf
>> faults 89.418 101.364
>> spf n/a 97.989
>>
>> With the SPF kernel, most of the page fault were processed in a speculative
>> way.
>>
>> ------------------
>> Benchmarks results
>>
>> Base kernel is v4.16-rc4-mmotm-2018-03-09-16-34
>> SPF is BASE + this series
>>
>> Kernbench:
>> ----------
>> Here are the results on a 16 CPUs X86 guest using kernbench on a 4.13-rc4
>> kernel (kernel is build 5 times):
>>
>> Average Half load -j 8
>> Run (std deviation)
>> BASE SPF
>> Elapsed Time 151.36 (1.40139) 151.748 (1.09716) 0.26%
>> User Time 1023.19 (3.58972) 1027.35 (2.30396) 0.41%
>> System Time 125.026 (1.8547) 124.504 (0.980015) -0.42%
>> Percent CPU 758.2 (5.54076) 758.6 (3.97492) 0.05%
>> Context Switches 54924 (453.634) 54851 (382.293) -0.13%
>> Sleeps 105589 (704.581) 105282 (435.502) -0.29%
>>
>> Average Optimal load -j 16
>> Run (std deviation)
>> BASE SPF
>> Elapsed Time 74.804 (1.25139) 74.368 (0.406288) -0.58%
>> User Time 962.033 (64.5125) 963.93 (66.8797) 0.20%
>> System Time 110.771 (15.0817) 110.387 (14.8989) -0.35%
>> Percent CPU 1045.7 (303.387) 1049.1 (306.255) 0.33%
>> Context Switches 76201.8 (22433.1) 76170.4 (22482.9) -0.04%
>> Sleeps 110289 (5024.05) 110220 (5248.58) -0.06%
>>
>> During a run on the SPF, perf events were captured:
>> Performance counter stats for '../kernbench -M':
>> 510334017 faults
>> 200 spf
>> 0 pagefault:spf_pte_lock
>> 0 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 2174 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> Very few speculative page fault were recorded as most of the processes
>> involved are monothreaded (sounds that on this architecture some threads
>> were created during the kernel build processing).
>>
>> Here are the kerbench results on a 80 CPUs Power8 system:
>>
>> Average Half load -j 40
>> Run (std deviation)
>> BASE SPF
>> Elapsed Time 116.958 (0.73401) 117.43 (0.927497) 0.40%
>> User Time 4472.35 (7.85792) 4480.16 (19.4909) 0.17%
>> System Time 136.248 (0.587639) 136.922 (1.09058) 0.49%
>> Percent CPU 3939.8 (20.6567) 3931.2 (17.2829) -0.22%
>> Context Switches 92445.8 (236.672) 92720.8 (270.118) 0.30%
>> Sleeps 318475 (1412.6) 317996 (1819.07) -0.15%
>>
>> Average Optimal load -j 80
>> Run (std deviation)
>> BASE SPF
>> Elapsed Time 106.976 (0.406731) 107.72 (0.329014) 0.70%
>> User Time 5863.47 (1466.45) 5865.38 (1460.27) 0.03%
>> System Time 159.995 (25.0393) 160.329 (24.6921) 0.21%
>> Percent CPU 5446.2 (1588.23) 5416 (1565.34) -0.55%
>> Context Switches 223018 (137637) 224867 (139305) 0.83%
>> Sleeps 330846 (13127.3) 332348 (15556.9) 0.45%
>>
>> During a run on the SPF, perf events were captured:
>> Performance counter stats for '../kernbench -M':
>> 116612488 faults
>> 0 spf
>> 0 pagefault:spf_pte_lock
>> 0 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 473 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> Most of the processes involved are monothreaded so SPF is not activated but
>> there is no impact on the performance.
>>
>> Ebizzy:
>> -------
>> The test is counting the number of records per second it can manage, the
>> higher is the best. I run it like this 'ebizzy -mTRp'. To get consistent
>> result I repeated the test 100 times and measure the average result. The
>> number is the record processes per second, the higher is the best.
>>
>> BASE SPF delta
>> 16 CPUs x86 VM 14902.6 95905.16 543.55%
>> 80 CPUs P8 node 37240.24 78185.67 109.95%
>>
>> Here are the performance counter read during a run on a 16 CPUs x86 VM:
>> Performance counter stats for './ebizzy -mRTp':
>> 888157 faults
>> 884773 spf
>> 92 pagefault:spf_pte_lock
>> 2379 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 80 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> And the ones captured during a run on a 80 CPUs Power node:
>> Performance counter stats for './ebizzy -mRTp':
>> 762134 faults
>> 728663 spf
>> 19101 pagefault:spf_pte_lock
>> 13969 pagefault:spf_vma_changed
>> 0 pagefault:spf_vma_noanon
>> 272 pagefault:spf_vma_notsup
>> 0 pagefault:spf_vma_access
>> 0 pagefault:spf_pmd_changed
>>
>> In ebizzy's case most of the page fault were handled in a speculative way,
>> leading the ebizzy performance boost.
>
> We ported the SPF to kernel 4.9 in android devices.
> For the app launch time, It improves about 15% average. For the apps
> which have hundreds of threads, it will be about 20%.
Hi Ganesh,
Thanks for sharing these great and encouraging results.
Could you please detail a bit more about your system configuration and
application ?
Laurent.
> Thanks.
>
>>
>> ------------------
>> Changes since v8:
>> - Don't check PMD when locking the pte when THP is disabled
>> Thanks to Daniel Jordan for reporting this.
>> - Rebase on 4.16
>> Changes since v7:
>> - move pte_map_lock() and pte_spinlock() upper in mm/memory.c (patch 4 &
>> 5)
>> - make pte_unmap_same() compatible with the speculative page fault (patch
>> 6)
>> Changes since v6:
>> - Rename config variable to CONFIG_SPECULATIVE_PAGE_FAULT (patch 1)
>> - Review the way the config variable is set (patch 1 to 3)
>> - Introduce mm_rb_write_*lock() in mm/mmap.c (patch 18)
>> - Merge patch introducing pte try locking in the patch 18.
>> Changes since v5:
>> - use rwlock agains the mm RB tree in place of SRCU
>> - add a VMA's reference count to protect VMA while using it without
>> holding the mmap_sem.
>> - check PMD value to detect collapsing operation
>> - don't try speculative page fault for mono threaded processes
>> - try to reuse the fetched VMA if VM_RETRY is returned
>> - go directly to the error path if an error is detected during the SPF
>> path
>> - fix race window when moving VMA in move_vma()
>> Changes since v4:
>> - As requested by Andrew Morton, use CONFIG_SPF and define it earlier in
>> the series to ease bisection.
>> Changes since v3:
>> - Don't build when CONFIG_SMP is not set
>> - Fixed a lock dependency warning in __vma_adjust()
>> - Use READ_ONCE to access p*d values in handle_speculative_fault()
>> - Call memcp_oom() service in handle_speculative_fault()
>> Changes since v2:
>> - Perf event is renamed in PERF_COUNT_SW_SPF
>> - On Power handle do_page_fault()'s cleaning
>> - On Power if the VM_FAULT_ERROR is returned by
>> handle_speculative_fault(), do not retry but jump to the error path
>> - If VMA's flags are not matching the fault, directly returns
>> VM_FAULT_SIGSEGV and not VM_FAULT_RETRY
>> - Check for pud_trans_huge() to avoid speculative path
>> - Handles _vm_normal_page()'s introduced by 6f16211df3bf
>> ("mm/device-public-memory: device memory cache coherent with CPU")
>> - add and review few comments in the code
>> Changes since v1:
>> - Remove PERF_COUNT_SW_SPF_FAILED perf event.
>> - Add tracing events to details speculative page fault failures.
>> - Cache VMA fields values which are used once the PTE is unlocked at the
>> end of the page fault events.
>> - Ensure that fields read during the speculative path are written and read
>> using WRITE_ONCE and READ_ONCE.
>> - Add checks at the beginning of the speculative path to abort it if the
>> VMA is known to not be supported.
>> Changes since RFC V5 [5]
>> - Port to 4.13 kernel
>> - Merging patch fixing lock dependency into the original patch
>> - Replace the 2 parameters of vma_has_changed() with the vmf pointer
>> - In patch 7, don't call __do_fault() in the speculative path as it may
>> want to unlock the mmap_sem.
>> - In patch 11-12, don't check for vma boundaries when
>> page_add_new_anon_rmap() is called during the spf path and protect against
>> anon_vma pointer's update.
>> - In patch 13-16, add performance events to report number of successful
>> and failed speculative events.
>>
>> [1]
>> http://linux-kernel.2935.n7.nabble.com/RFC-PATCH-0-6-Another-go-at-speculative-page-faults-tt965642.html#none
>> [2] https://patchwork.kernel.org/patch/9999687/
>>
>>
>> Laurent Dufour (20):
>> mm: Introduce CONFIG_SPECULATIVE_PAGE_FAULT
>> x86/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT
>> powerpc/mm: Define CONFIG_SPECULATIVE_PAGE_FAULT
>> mm: Introduce pte_spinlock for FAULT_FLAG_SPECULATIVE
>> mm: make pte_unmap_same compatible with SPF
>> mm: Protect VMA modifications using VMA sequence count
>> mm: protect mremap() against SPF hanlder
>> mm: Protect SPF handler against anon_vma changes
>> mm: Cache some VMA fields in the vm_fault structure
>> mm/migrate: Pass vm_fault pointer to migrate_misplaced_page()
>> mm: Introduce __lru_cache_add_active_or_unevictable
>> mm: Introduce __maybe_mkwrite()
>> mm: Introduce __vm_normal_page()
>> mm: Introduce __page_add_new_anon_rmap()
>> mm: Protect mm_rb tree with a rwlock
>> mm: Adding speculative page fault failure trace events
>> perf: Add a speculative page fault sw event
>> perf tools: Add support for the SPF perf event
>> mm: Speculative page fault handler return VMA
>> powerpc/mm: Add speculative page fault
>>
>> Peter Zijlstra (4):
>> mm: Prepare for FAULT_FLAG_SPECULATIVE
>> mm: VMA sequence count
>> mm: Provide speculative fault infrastructure
>> x86/mm: Add speculative pagefault handling
>>
>> arch/powerpc/Kconfig | 1 +
>> arch/powerpc/mm/fault.c | 31 +-
>> arch/x86/Kconfig | 1 +
>> arch/x86/mm/fault.c | 38 ++-
>> fs/proc/task_mmu.c | 5 +-
>> fs/userfaultfd.c | 17 +-
>> include/linux/hugetlb_inline.h | 2 +-
>> include/linux/migrate.h | 4 +-
>> include/linux/mm.h | 92 +++++-
>> include/linux/mm_types.h | 7 +
>> include/linux/pagemap.h | 4 +-
>> include/linux/rmap.h | 12 +-
>> include/linux/swap.h | 10 +-
>> include/trace/events/pagefault.h | 87 +++++
>> include/uapi/linux/perf_event.h | 1 +
>> kernel/fork.c | 3 +
>> mm/Kconfig | 3 +
>> mm/hugetlb.c | 2 +
>> mm/init-mm.c | 3 +
>> mm/internal.h | 20 ++
>> mm/khugepaged.c | 5 +
>> mm/madvise.c | 6 +-
>> mm/memory.c | 594 ++++++++++++++++++++++++++++++----
>> mm/mempolicy.c | 51 ++-
>> mm/migrate.c | 4 +-
>> mm/mlock.c | 13 +-
>> mm/mmap.c | 211 +++++++++---
>> mm/mprotect.c | 4 +-
>> mm/mremap.c | 13 +
>> mm/rmap.c | 5 +-
>> mm/swap.c | 6 +-
>> mm/swap_state.c | 8 +-
>> tools/include/uapi/linux/perf_event.h | 1 +
>> tools/perf/util/evsel.c | 1 +
>> tools/perf/util/parse-events.c | 4 +
>> tools/perf/util/parse-events.l | 1 +
>> tools/perf/util/python.c | 1 +
>> 37 files changed, 1097 insertions(+), 174 deletions(-)
>> create mode 100644 include/trace/events/pagefault.h
>>
>> --
>> 2.7.4
>>
>