Re: [PATCH v9 0/8] Reduce tlb and interrupt numbers over 90% by improving folio migration
From: Byungchul Park
Date: Wed Apr 17 2024 - 06:33:54 EST
On Wed, Apr 17, 2024 at 04:18:39PM +0900, Byungchul Park wrote:
> Hi everyone,
>
> While I'm working with a tiered memory system e.g. CXL memory, I have
> been facing migration overhead esp. tlb shootdown on promotion or
> demotion between different tiers. Yeah.. most tlb shootdowns on
> migration through hinting fault can be avoided thanks to Huang Ying's
> work, commit 4d4b6d66db ("mm,unmap: avoid flushing tlb in batch if PTE
> is inaccessible"). See the following link for more information:
>
> https://lore.kernel.org/lkml/20231115025755.GA29979@xxxxxxxxxxxxxxxxxxx/
>
> However, it's only for ones using hinting fault. I thought it'd be much
> better if we have a general mechanism to reduce all tlb numbers that we
> can ultimately apply to any type of migration.
>
> I'm suggesting a mechanism called MIGRC that stands for 'Migration Read
> Copy', to reduce tlb numbers by deferring tlb flush until the source
> folios at migration actually become used, of course, only if the target
> PTE don't have write permission.
I stably obtain the performance improvement over 5% in end user's point
of view, which I think is not negligible. Thoughts?
Byungchul
> To achieve that:
>
> 1. For the folios that map only to non-writable tlb entries, prevent
> tlb flush during migration but perform it just before the source
> folios actually become used out of buddy or pcp.
>
> 2. When any non-writable tlb entry changes to writable e.g. through
> fault handler, give up migrc mechanism and perform tlb flush
> required right away.
>
> No matter what type of workload is used for performance evaluation, the
> result would be positive thanks to the unconditional reduction of tlb
> flushes, tlb misses and interrupts. For the test, I picked up XSBench
> that is widely used for performance analysis on high performance
> computing architectures - https://github.com/ANL-CESAR/XSBench.
>
> The result would depend on memory latency and how often reclaim runs,
> which implies tlb miss overhead and how many times migration happens.
> The slower the memory is and the more reclaim runs, the better migrc
> works so as to obtain the better result. In my system, the result
> shows:
>
> 1. itlb flushes are reduced over 90%.
> 2. itlb misses are reduced over 30%.
> 3. All the other tlb numbers also get enhanced.
> 4. tlb shootdown interrupts are reduced over 90%.
> 5. The test program runtime is reduced over 5%.
>
> The test envitonment:
>
> Architecture - x86_64
> QEMU - kvm enabled, host cpu
> Numa - 2 nodes (16 CPUs 1GB, no CPUs 99GB)
> Linux Kernel - v6.9-rc4, numa balancing tiering on, demotion enabled
>
> < measurement: raw data - tlb and interrupt numbers >
>
> $ perf stat -a \
> -e itlb.itlb_flush \
> -e tlb_flush.dtlb_thread \
> -e tlb_flush.stlb_any \
> -e dtlb-load-misses \
> -e dtlb-store-misses \
> -e itlb-load-misses \
> XSBench -t 16 -p 50000000
>
> $ grep "TLB shootdowns" /proc/interrupts
>
> BEFORE
> ------
> 40417078 itlb.itlb_flush
> 234852566 tlb_flush.dtlb_thread
> 153192357 tlb_flush.stlb_any
> 119001107892 dTLB-load-misses
> 307921167 dTLB-store-misses
> 1355272118 iTLB-load-misses
>
> TLB: 1364803 1303670 1333921 1349607
> 1356934 1354216 1332972 1342842
> 1350265 1316443 1355928 1360793
> 1298239 1326358 1343006 1340971
> TLB shootdowns
>
> AFTER
> -----
> 3316495 itlb.itlb_flush
> 138912511 tlb_flush.dtlb_thread
> 115199341 tlb_flush.stlb_any
> 117610390021 dTLB-load-misses
> 198042233 dTLB-store-misses
> 840066984 iTLB-load-misses
>
> TLB: 117257 119219 117178 115737
> 117967 118948 117508 116079
> 116962 117266 117320 117215
> 105808 103934 115672 117610
> TLB shootdowns
>
> < measurement: user experience - runtime >
>
> $ time XSBench -t 16 -p 50000000
>
> BEFORE
> ------
> Threads: 16
> Runtime: 968.783 seconds
> Lookups: 1,700,000,000
> Lookups/s: 1,754,778
>
> 15208.91s user 141.44s system 1564% cpu 16:20.98 total
>
> AFTER
> -----
> Threads: 16
> Runtime: 913.210 seconds
> Lookups: 1,700,000,000
> Lookups/s: 1,861,565
>
> 14351.69s user 138.23s system 1565% cpu 15:25.47 total
>
> ---
>
> Changes from v8:
> 1. Rebase on the latest, v6.9-rc4.
> 2. Supplement comments and commit message.
> 3. Change the candidate to apply migrc mechnism:
>
> BEFORE - The source folios at demotion and promotion.
> AFTER - The souce folios at any type of migration.
>
> 4. Change how migrc mechanism works:
>
> BEFORE - Reduce tlb flushes by deferring folio_free() for
> source folios during demotion and promotion.
> AFTER - Reduce tlb flushes by deferring tlb flush until they
> actually become used, out of pcp or buddy. The
> current version of migrc does *not* defer calling
> folio_free() but let it go as it is as the same as
> vanilla kernel, with the folios marked kind of 'need
> to tlb flush'. And then handle the flush when the
> page exits from pcp or buddy so as to prevent
> changing vm stats e.g. free pages.
>
> Changes from v7:
> 1. Rewrite cover letter to explain what 'migrc' mechasism is.
> (feedbacked by Andrew Morton)
> 2. Supplement the commit message of a patch 'mm: Add APIs to
> free a folio directly to the buddy bypassing pcp'.
> (feedbacked by Andrew Morton)
>
> Changes from v6:
> 1. Fix build errors in case of
> CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH disabled by moving
> migrc_flush_{start,end}() calls from arch code to
> try_to_unmap_flush() in mm/rmap.c.
>
> Changes from v5:
> 1. Fix build errors in case of CONFIG_MIGRATION disabled or
> CONFIG_HWPOISON_INJECT moduled. (feedbacked by kernel test
> bot and Raymond Jay Golo)
> 2. Organize migrc code with two kconfigs, CONFIG_MIGRATION and
> CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH.
>
> Changes from v4:
>
> 1. Rebase on v6.7.
> 2. Fix build errors in arm64 that is doing nothing for tlb flush
> but has CONFIG_ARCH_WANT_BATCHED_UNMAP_tlb_FLUSH. (reported
> by kernel test robot)
> 3. Don't use any page flag. So the system would give up migrc
> mechanism more often but it's okay. The final improvement is
> good enough.
> 4. Instead, optimize full tlb flush(arch_tlbbatch_flush()) by
> avoiding redundant CPUs from tlb flush.
>
> Changes from v3:
>
> 1. Don't use the kconfig, CONFIG_MIGRC, and remove sysctl knob,
> migrc_enable. (feedbacked by Nadav)
> 2. Remove the optimization skipping CPUs that have already
> performed tlb flushes needed by any reason when performing
> tlb flushes by migrc because I can't tell the performance
> difference between w/ the optimization and w/o that.
> (feedbacked by Nadav)
> 3. Minimize arch-specific code. While at it, move all the migrc
> declarations and inline functions from include/linux/mm.h to
> mm/internal.h (feedbacked by Dave Hansen, Nadav)
> 4. Separate a part making migrc paused when the system is in
> high memory pressure to another patch. (feedbacked by Nadav)
> 5. Rename:
> a. arch_tlbbatch_clean() to arch_tlbbatch_clear(),
> b. tlb_ubc_nowr to tlb_ubc_ro,
> c. migrc_try_flush_free_folios() to migrc_flush_free_folios(),
> d. migrc_stop to migrc_pause.
> (feedbacked by Nadav)
> 6. Use ->lru list_head instead of introducing a new llist_head.
> (feedbacked by Nadav)
> 7. Use non-atomic operations of page-flag when it's safe.
> (feedbacked by Nadav)
> 8. Use stack instead of keeping a pointer of 'struct migrc_req'
> in struct task, which is for manipulating it locally.
> (feedbacked by Nadav)
> 9. Replace a lot of simple functions to inline functions placed
> in a header, mm/internal.h. (feedbacked by Nadav)
> 10. Add additional sufficient comments. (feedbacked by Nadav)
> 11. Remove a lot of wrapper functions. (feedbacked by Nadav)
>
> Changes from RFC v2:
>
> 1. Remove additional occupation in struct page. To do that,
> unioned with lru field for migrc's list and added a page
> flag. I know page flag is a thing that we don't like to add
> but no choice because migrc should distinguish folios under
> migrc's control from others. Instead, I force migrc to be
> used only on 64 bit system to mitigate you guys from getting
> angry.
> 2. Remove meaningless internal object allocator that I
> introduced to minimize impact onto the system. However, a ton
> of tests showed there was no difference.
> 3. Stop migrc from working when the system is in high memory
> pressure like about to perform direct reclaim. At the
> condition where the swap mechanism is heavily used, I found
> the system suffered from regression without this control.
> 4. Exclude folios that pte_dirty() == true from migrc's interest
> so that migrc can work simpler.
> 5. Combine several patches that work tightly coupled to one.
> 6. Add sufficient comments for better review.
> 7. Manage migrc's request in per-node manner (from globally).
> 8. Add tlb miss improvement in commit message.
> 9. Test with more CPUs(4 -> 16) to see bigger improvement.
>
> Changes from RFC:
>
> 1. Fix a bug triggered when a destination folio at the previous
> migration becomes a source folio at the next migration,
> before the folio gets handled properly so that the folio can
> play with another migration. There was inconsistency in the
> folio's state. Fixed it.
> 2. Split the patch set into more pieces so that the folks can
> review better. (Feedbacked by Nadav Amit)
> 3. Fix a wrong usage of barrier e.g. smp_mb__after_atomic().
> (Feedbacked by Nadav Amit)
> 4. Tried to add sufficient comments to explain the patch set
> better. (Feedbacked by Nadav Amit)
>
> Byungchul Park (8):
> x86/tlb: add APIs manipulating tlb batch's arch data
> arm64: tlbflush: add APIs manipulating tlb batch's arch data
> mm/rmap: recognize read-only tlb entries during batched tlb flush
> x86/tlb, mm/rmap: separate arch_tlbbatch_clear() out of
> arch_tlbbatch_flush()
> mm: separate move/undo parts from migrate_pages_batch()
> mm: buddy: make room for a new variable, mgen, in struct page
> mm: add folio_put_mgen() to deliver migrc's generation number to pcp
> or buddy
> mm: defer tlb flush until the source folios at migration actually get
> used
>
> arch/arm64/include/asm/tlbflush.h | 18 ++
> arch/x86/include/asm/tlbflush.h | 18 ++
> arch/x86/mm/tlb.c | 2 -
> include/linux/mm.h | 22 ++
> include/linux/mm_types.h | 39 ++-
> include/linux/sched.h | 10 +
> mm/compaction.c | 10 +
> mm/internal.h | 85 +++++-
> mm/memory.c | 8 +
> mm/migrate.c | 487 ++++++++++++++++++++++++++----
> mm/page_alloc.c | 125 ++++++--
> mm/page_reporting.c | 10 +
> mm/rmap.c | 40 ++-
> mm/swap.c | 19 +-
> 14 files changed, 795 insertions(+), 98 deletions(-)
>
>
> base-commit: 0bbac3facb5d6cc0171c45c9873a2dc96bea9680
> --
> 2.17.1