Re: [v3 2/3] mm: Defer TLB flush by keeping both src and dst folios at migration

From: Byungchul Park
Date: Mon Oct 30 2023 - 05:59:08 EST


On Mon, Oct 30, 2023 at 09:00:56AM +0100, David Hildenbrand wrote:
> On 30.10.23 08:25, Byungchul Park wrote:
> > Implementation of CONFIG_MIGRC that stands for 'Migration Read Copy'.
> > We always face the migration overhead at either promotion or demotion,
> > while working with tiered memory e.g. CXL memory and found out TLB
> > shootdown is a quite big one that is needed to get rid of if possible.
> >
> > Fortunately, TLB flush can be defered or even skipped if both source and
> > destination of folios during migration are kept until all TLB flushes
> > required will have been done, of course, only if the target PTE entries
> > have read only permission, more precisely speaking, don't have write
> > permission. Otherwise, no doubt the folio might get messed up.
> >
> > To achieve that:
> >
> > 1. For the folios that map only to non-writable TLB entries, prevent
> > TLB flush at migration by keeping both source and destination
> > folios, which will be handled later at a better time.
> >
> > 2. When any non-writable TLB entry changes to writable e.g. through
> > fault handler, give up CONFIG_MIGRC mechanism so as to perform
> > TLB flush required right away.
> >
> > 3. Temporarily stop migrc from working when the system is in very
> > high memory pressure e.g. direct reclaim needed.
> >
> > The measurement result:
> >
> > Architecture - x86_64
> > QEMU - kvm enabled, host cpu
> > Numa - 2 nodes (16 CPUs 1GB, no CPUs 8GB)
> > Linux Kernel - v6.6-rc5, numa balancing tiering on, demotion enabled
> > Benchmark - XSBench -p 50000000 (-p option makes the runtime longer)
> >
> > run 'perf stat' using events:
> > 1) itlb.itlb_flush
> > 2) tlb_flush.dtlb_thread
> > 3) tlb_flush.stlb_any
> > 4) dTLB-load-misses
> > 5) dTLB-store-misses
> > 6) iTLB-load-misses
> >
> > run 'cat /proc/vmstat' and pick:
> > 1) numa_pages_migrated
> > 2) pgmigrate_success
> > 3) nr_tlb_remote_flush
> > 4) nr_tlb_remote_flush_received
> > 5) nr_tlb_local_flush_all
> > 6) nr_tlb_local_flush_one
> >
> > BEFORE - mainline v6.6-rc5
> > ------------------------------------------
> > $ perf stat -a \
> > -e itlb.itlb_flush \
> > -e tlb_flush.dtlb_thread \
> > -e tlb_flush.stlb_any \
> > -e dTLB-load-misses \
> > -e dTLB-store-misses \
> > -e iTLB-load-misses \
> > ./XSBench -p 50000000
> >
> > Performance counter stats for 'system wide':
> >
> > 20953405 itlb.itlb_flush
> > 114886593 tlb_flush.dtlb_thread
> > 88267015 tlb_flush.stlb_any
> > 115304095543 dTLB-load-misses
> > 163904743 dTLB-store-misses
> > 608486259 iTLB-load-misses
> >
> > 556.787113849 seconds time elapsed
> >
> > $ cat /proc/vmstat
> >
> > ...
> > numa_pages_migrated 3378748
> > pgmigrate_success 7720310
> > nr_tlb_remote_flush 751464
> > nr_tlb_remote_flush_received 10742115
> > nr_tlb_local_flush_all 21899
> > nr_tlb_local_flush_one 740157
> > ...
> >
> > AFTER - mainline v6.6-rc5 + CONFIG_MIGRC
> > ------------------------------------------
> > $ perf stat -a \
> > -e itlb.itlb_flush \
> > -e tlb_flush.dtlb_thread \
> > -e tlb_flush.stlb_any \
> > -e dTLB-load-misses \
> > -e dTLB-store-misses \
> > -e iTLB-load-misses \
> > ./XSBench -p 50000000
> >
> > Performance counter stats for 'system wide':
> >
> > 4353555 itlb.itlb_flush
> > 72482780 tlb_flush.dtlb_thread
> > 68226458 tlb_flush.stlb_any
> > 114331610808 dTLB-load-misses
> > 116084771 dTLB-store-misses
> > 377180518 iTLB-load-misses
> >
> > 552.667718220 seconds time elapsed
> >
> > $ cat /proc/vmstat
> >
>
> So, an improvement of 0.74% ? How stable are the results? Serious question:

I'm getting very stable result.

> worth the churn?

Yes, ultimately the time wise improvement should be observed. However,
I've been focusing on the numbers of TLB flushes and TLB misses because
better result in terms of total time will be followed depending on the
test condition. We can see the result if we test with a system that:

1. has more CPUs that would induce a crazy number of IPIs.
2. has slow memories that makes TLB miss overhead bigger.
3. runs workloads that is harmful at TLB miss and IPI storm.
4. runs workloads that causes heavier numa migrations.
5. runs workloads that has a lot of read only permission mappings.
6. and so on.

I will share the results once I manage to meet the conditions.

By the way, I should've added IPI reduction because it also has super
big delta :)

> Or did I get the numbers wrong?
>
> > #define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index 5c02720c53a5..1ca2ac91aa14 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -135,6 +135,9 @@ enum pageflags {
> > #ifdef CONFIG_ARCH_USES_PG_ARCH_X
> > PG_arch_2,
> > PG_arch_3,
> > +#endif
> > +#ifdef CONFIG_MIGRC
> > + PG_migrc, /* Page has its copy under migrc's control */
> > #endif
> > __NR_PAGEFLAGS,
> > @@ -589,6 +592,10 @@ TESTCLEARFLAG(Young, young, PF_ANY)
> > PAGEFLAG(Idle, idle, PF_ANY)
> > #endif
> > +#ifdef CONFIG_MIGRC
> > +PAGEFLAG(Migrc, migrc, PF_ANY)
> > +#endif
>
> I assume you know this: new pageflags are frowned upon.

Sorry for that. I really didn't want to add a new headache.

Byungchul