Re: [mm, thp] 85b9f46e8e: vm-scalability.throughput -8.7% regression

From: David Rientjes
Date: Tue Oct 20 2020 - 14:19:54 EST


On Tue, 20 Oct 2020, Huang, Ying wrote:

> >> =========================================================================================
> >> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
> >> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/1T/lkp-skl-fpga01/lru-shm/vm-scalability/0x2006906
> >>
> >> commit:
> >> dcdf11ee14 ("mm, shmem: add vmstat for hugepage fallback")
> >> 85b9f46e8e ("mm, thp: track fallbacks due to failed memcg charges separately")
> >>
> >> dcdf11ee14413332 85b9f46e8ea451633ccd60a7d8c
> >> ---------------- ---------------------------
> >> fail:runs %reproduction fail:runs
> >> | | |
> >> 1:4 24% 2:4 perf-profile.calltrace.cycles-pp.sync_regs.error_entry.do_access
> >> 3:4 53% 5:4 perf-profile.calltrace.cycles-pp.error_entry.do_access
> >> 9:4 -27% 8:4 perf-profile.children.cycles-pp.error_entry
> >> 4:4 -10% 4:4 perf-profile.self.cycles-pp.error_entry
> >> %stddev %change %stddev
> >> \ | \
> >> 477291 -9.1% 434041 vm-scalability.median
> >> 49791027 -8.7% 45476799 vm-scalability.throughput
> >> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time
> >> 223.67 +1.6% 227.36 vm-scalability.time.elapsed_time.max
> >> 50364 ± 6% +24.1% 62482 ± 10% vm-scalability.time.involuntary_context_switches
> >> 2237 +7.8% 2412 vm-scalability.time.percent_of_cpu_this_job_got
> >> 3084 +18.2% 3646 vm-scalability.time.system_time
> >> 1921 -4.2% 1839 vm-scalability.time.user_time
> >> 13.68 +2.2 15.86 mpstat.cpu.all.sys%
> >> 28535 ± 30% -47.0% 15114 ± 79% numa-numastat.node0.other_node
> >> 142734 ± 11% -19.4% 115000 ± 17% numa-meminfo.node0.AnonPages
> >> 11168 ± 3% +8.8% 12150 ± 5% numa-meminfo.node1.PageTables
> >> 76.00 -1.6% 74.75 vmstat.cpu.id
> >> 3626 -1.9% 3555 vmstat.system.cs
> >> 2214928 ±166% -96.6% 75321 ± 7% cpuidle.C1.usage
> >> 200981 ± 7% -18.0% 164861 ± 7% cpuidle.POLL.time
> >> 52675 ± 3% -16.7% 43866 ± 10% cpuidle.POLL.usage
> >> 35659 ± 11% -19.4% 28754 ± 17% numa-vmstat.node0.nr_anon_pages
> >> 1248014 ± 3% +10.9% 1384236 numa-vmstat.node1.nr_mapped
> >> 2722 ± 4% +10.6% 3011 ± 5% numa-vmstat.node1.nr_page_table_pages
> >
> > I'm not sure that I'm reading this correctly, but I suspect that this just
> > happens because of NUMA: memory affinity will obviously impact
> > vm-scalability.throughput quite substantially, but I don't think the
> > bisected commit can be to be blame. Commit 85b9f46e8ea4 ("mm, thp: track
> > fallbacks due to failed memcg charges separately") simply adds new
> > count_vm_event() calls in a couple areas to track thp fallback due to
> > memcg limits separate from fragmentation.
> >
> > It's likely a question about the testing methodology in general: for
> > memory intensive benchmarks, I suggest it is configured in a manner that
> > we can expect consistent memory access latency at the hardware level when
> > running on a NUMA system.
>
> So you think it's better to bind processes to NUMA node or CPU? But we
> want to use this test case to capture NUMA/CPU placement/balance issue
> too.
>

No, because binding to a specific socket may cause other performance
"improvements" or "degradations" depending on how fragmented local memory
is, or whether or not it's under memory pressure. Is the system rebooted
before testing so that we have a consistent state of memory availability
and fragmentation across sockets?

> 0day solve the problem in another way. We run the test case
> multiple-times and calculate the average and standard deviation, then
> compare.
>

Depending on fragmentation or memory availability, any benchmark that
assesses performance may be adversely affected if its results can be
impacted by hugepage backing.