Re: [LKP] [mm] 23047a96d7: vm-scalability.throughput -23.8% regression

From: Ye Xiaolong
Date: Fri May 27 2016 - 04:13:12 EST


On Wed, May 25, 2016 at 02:06:17PM +0800, Ye Xiaolong wrote:
>On Mon, May 23, 2016 at 04:46:05PM -0400, Johannes Weiner wrote:
>>Hi,
>>
>>thanks for your report.
>>
>>On Tue, May 17, 2016 at 12:58:05PM +0800, kernel test robot wrote:
>>> FYI, we noticed vm-scalability.throughput -23.8% regression due to commit:
>>>
>>> commit 23047a96d7cfcfca1a6d026ecaec526ea4803e9e ("mm: workingset: per-cgroup cache thrash detection")
>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>>
>>> in testcase: vm-scalability
>>> on test machine: lkp-hsw01: 56 threads Grantley Haswell-EP with 64G memory
>>> with following conditions: cpufreq_governor=performance/runtime=300s/test=lru-file-readtwice
>>
>>That test hammers the LRU activation path, to which this patch added
>>the cgroup lookup and pinning code. Does the following patch help?
>>

Hi, Johannes

FYI, I have done more tests with your fix patch.

1) apply it on top of latest kernel (head commit: 478a1469("Merge tag 'dax-locking-for-4.7' of
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm")

The following is the comparison info among first bad commit's parent, first
bad commit, head commit of linus' master branch, your fix commit(a7abed95):

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/300s/lkp-hsw01/lru-file-readtwice/vm-scalability

commit:
612e44939c3c77245ac80843c0c7876c8cf97282
23047a96d7cfcfca1a6d026ecaec526ea4803e9e
478a1469a7d27fe6b2f85fc801ecdeb8afc836e6
a7abed950afdc1186d4eaf442b7eb296ff04c947

612e44939c3c7724 23047a96d7cfcfca1a6d026eca 478a1469a7d27fe6b2f85fc801 a7abed950afdc1186d4eaf442b
---------------- -------------------------- -------------------------- --------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
28384711 ± 0% -23.8% 21621405 ± 0% -12.4% 24865101 ± 4% -8.1% 26076417 ± 3% vm-scalability.throughput
1854112 ± 0% -7.7% 1711141 ± 0% +6.4% 1973257 ± 4% +9.2% 2025214 ± 3% vm-scalability.time.involuntary_context_switches
5279 ± 0% -0.7% 5243 ± 0% -2.6% 5143 ± 0% -2.4% 5153 ± 0% vm-scalability.time.percent_of_cpu_this_job_got
16267 ± 0% -0.6% 16173 ± 0% -2.0% 15934 ± 0% -1.8% 15978 ± 0% vm-scalability.time.system_time
176.03 ± 0% -22.2% 136.95 ± 1% -10.4% 157.66 ± 1% -11.2% 156.32 ± 0% vm-scalability.time.user_time
302905 ± 2% -31.2% 208386 ± 0% +5.8% 320618 ± 47% -36.0% 193991 ± 22% vm-scalability.time.voluntary_context_switches
0.92 ± 2% +51.0% 1.38 ± 2% +96.5% 1.80 ± 0% +97.3% 1.81 ± 0% perf-profile.cycles-pp.kswapd
2585 ± 1% -1.9% 2536 ± 0% +9.6% 2834 ± 1% +10.7% 2862 ± 1% uptime.idle
754212 ± 1% -29.2% 533832 ± 2% -34.8% 491397 ± 2% -27.5% 546666 ± 8% softirqs.RCU
151918 ± 8% +5.7% 160522 ± 5% -17.4% 125419 ± 18% -22.7% 117409 ± 7% softirqs.SCHED
176.03 ± 0% -22.2% 136.95 ± 1% -10.4% 157.66 ± 1% -11.2% 156.32 ± 0% time.user_time
302905 ± 2% -31.2% 208386 ± 0% +5.8% 320618 ± 47% -36.0% 193991 ± 22% time.voluntary_context_switches


2) apply it on top of v4.6 (head commit: 2dcd0af5("Linux 4.6"))

The following is the comparison info among first bad commit's parent, first
bad commit, v4.6, your fix commit(c05f8814):

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/300s/lkp-hsw01/lru-file-readtwice/vm-scalability

commit:
612e44939c3c77245ac80843c0c7876c8cf97282
23047a96d7cfcfca1a6d026ecaec526ea4803e9e
v4.6
c05f8814641ceabbc628cd4edc7f64ff58498d5a

612e44939c3c7724 23047a96d7cfcfca1a6d026eca v4.6 c05f8814641ceabbc628cd4edc
---------------- -------------------------- -------------------------- --------------------------
%stddev %change %stddev %change %stddev %change %stddev
\ | \ | \ | \
28384711 ± 0% -23.8% 21621405 ± 0% -18.9% 23013011 ± 0% -19.2% 22937943 ± 0% vm-scalability.throughput
1854112 ± 0% -7.7% 1711141 ± 0% -5.2% 1757124 ± 0% -4.9% 1762398 ± 0% vm-scalability.time.involuntary_context_switches
66021 ± 0% -0.4% 65745 ± 0% +1.8% 67231 ± 1% +3.4% 68291 ± 0% vm-scalability.time.minor_page_faults
5279 ± 0% -0.7% 5243 ± 0% -1.6% 5197 ± 0% -1.5% 5198 ± 0% vm-scalability.time.percent_of_cpu_this_job_got
16267 ± 0% -0.6% 16173 ± 0% -1.5% 16030 ± 0% -1.4% 16032 ± 0% vm-scalability.time.system_time
176.03 ± 0% -22.2% 136.95 ± 1% -17.8% 144.66 ± 1% -17.0% 146.15 ± 1% vm-scalability.time.user_time
302905 ± 2% -31.2% 208386 ± 0% -19.4% 244167 ± 1% -19.3% 244584 ± 1% vm-scalability.time.voluntary_context_switches
23999 ± 2% -5.9% 22575 ± 2% -7.1% 22291 ± 3% -7.0% 22319 ± 1% meminfo.Mapped
0.92 ± 2% +51.0% 1.38 ± 2% +96.8% 1.81 ± 0% +95.5% 1.79 ± 0% perf-profile.cycles-pp.kswapd
754212 ± 1% -29.2% 533832 ± 2% -41.1% 444019 ± 4% -42.8% 431345 ± 0% softirqs.RCU
20518 ± 2% -8.1% 18866 ± 2% -2.6% 19980 ± 3% -2.9% 19926 ± 7% vmstat.system.cs
10574 ± 19% +29.9% 13737 ± 8% +1.6% 10740 ± 33% +16.9% 12359 ± 17% numa-meminfo.node0.Mapped
13490 ± 13% -36.6% 8549 ± 17% -13.3% 11689 ± 29% -26.5% 9912 ± 22% numa-meminfo.node1.Mapped
176.03 ± 0% -22.2% 136.95 ± 1% -17.8% 144.66 ± 1% -17.0% 146.15 ± 1% time.user_time
302905 ± 2% -31.2% 208386 ± 0% -19.4% 244167 ± 1% -19.3% 244584 ± 1% time.voluntary_context_switches


Thanks,
Xiaolong
>
>Hi,
>
>Here is the comparison of original first bad commit (23047a96d) and your new patch (063f6715e).
>vm-scalability.throughput improved 11.3%.
>
>
>compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase:
> gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/300s/lkp-hsw01/lru-file-readtwice/vm-scalability
>
>commit:
> 23047a96d7cfcfca1a6d026ecaec526ea4803e9e
> 063f6715e77a7be5770d6081fe6d7ca2437ac9f2
>
>23047a96d7cfcfca 063f6715e77a7be5770d6081fe
>---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 21621405 ± 0% +11.3% 24069657 ± 2% vm-scalability.throughput
> 1711141 ± 0% +40.9% 2411083 ± 2% vm-scalability.time.involuntary_context_switches
> 2747 ± 0% +2.4% 2812 ± 0% vm-scalability.time.maximum_resident_set_size
> 5243 ± 0% -1.2% 5180 ± 0% vm-scalability.time.percent_of_cpu_this_job_got
> 136.95 ± 1% +13.6% 155.55 ± 0% vm-scalability.time.user_time
> 208386 ± 0% -71.5% 59394 ± 16% vm-scalability.time.voluntary_context_switches
> 1.38 ± 2% +21.7% 1.69 ± 2% perf-profile.cycles-pp.kswapd
> 160522 ± 5% -30.0% 112342 ± 2% softirqs.SCHED
> 2536 ± 0% +7.3% 2722 ± 2% uptime.idle
> 1711141 ± 0% +40.9% 2411083 ± 2% time.involuntary_context_switches
> 136.95 ± 1% +13.6% 155.55 ± 0% time.user_time
> 208386 ± 0% -71.5% 59394 ± 16% time.voluntary_context_switches
> 1052 ± 13% +1453.8% 16346 ± 39% cpuidle.C1-HSW.usage
> 1045 ± 12% -54.3% 477.50 ± 25% cpuidle.C3-HSW.usage
> 5.719e+08 ± 1% +17.9% 6.743e+08 ± 0% cpuidle.C6-HSW.time
> 40424411 ± 2% -97.3% 1076732 ± 99% cpuidle.POLL.time
> 7179 ± 5% -99.9% 6.50 ± 53% cpuidle.POLL.usage
> 0.51 ± 8% -40.6% 0.30 ± 13% turbostat.CPU%c1
> 2.83 ± 2% +30.5% 3.70 ± 0% turbostat.CPU%c6
> 0.23 ± 79% +493.4% 1.35 ± 2% turbostat.Pkg%pc2
> 255.52 ± 0% +3.3% 263.95 ± 0% turbostat.PkgWatt
> 53.26 ± 0% +14.9% 61.22 ± 0% turbostat.RAMWatt
> 1836104 ± 0% +13.3% 2079934 ± 4% vmstat.memory.free
> 5.00 ± 0% -70.0% 1.50 ± 33% vmstat.procs.b
> 107.00 ± 0% +8.4% 116.00 ± 2% vmstat.procs.r
> 18866 ± 2% +40.1% 26436 ± 13% vmstat.system.cs
> 69056 ± 0% +11.8% 77219 ± 1% vmstat.system.in
> 31628132 ± 0% +80.9% 57224963 ± 0% meminfo.Active
> 31294504 ± 0% +81.7% 56876042 ± 0% meminfo.Active(file)
> 142271 ± 6% +11.2% 158138 ± 5% meminfo.DirectMap4k
> 30612825 ± 0% -87.2% 3915695 ± 0% meminfo.Inactive
> 30562772 ± 0% -87.4% 3862631 ± 0% meminfo.Inactive(file)
> 15635 ± 1% +38.0% 21572 ± 8% meminfo.KernelStack
> 22575 ± 2% +7.7% 24316 ± 4% meminfo.Mapped
> 1762372 ± 3% +12.2% 1976873 ± 3% meminfo.MemFree
> 847557 ± 0% +105.5% 1741958 ± 8% meminfo.SReclaimable
> 946378 ± 0% +95.1% 1846370 ± 8% meminfo.Slab
>
>Thanks,
>Xiaolong
>
>>From b535c630fd8954865b7536c915c3916beb3b4830 Mon Sep 17 00:00:00 2001
>>From: Johannes Weiner <hannes@xxxxxxxxxxx>
>>Date: Mon, 23 May 2016 16:14:24 -0400
>>Subject: [PATCH] mm: fix vm-scalability regression in workingset_activation()
>>
>>23047a96d7cf ("mm: workingset: per-cgroup cache thrash detection")
>>added cgroup lookup and pinning overhead to the LRU activation path,
>>which the vm-scalability benchmark is particularly sensitive to.
>>
>>Inline the lookup functions to eliminate calls. Furthermore, since
>>activations are not moved when pages are moved between memcgs, we
>>don't need the full page->mem_cgroup locking; holding the RCU lock is
>>enough to prevent the memcg from being freed.
>>
>>Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>>---
>> include/linux/memcontrol.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
>> include/linux/mm.h | 8 ++++++++
>> mm/memcontrol.c | 42 ------------------------------------------
>> mm/workingset.c | 10 ++++++----
>> 4 files changed, 56 insertions(+), 47 deletions(-)
>>
>>diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>index a805474df4ab..0bb36cf89bf6 100644
>>--- a/include/linux/memcontrol.h
>>+++ b/include/linux/memcontrol.h
>>@@ -306,7 +306,48 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
>>
>> void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
>>
>>-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>>+static inline struct mem_cgroup_per_zone *
>>+mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>>+{
>>+ int nid = zone_to_nid(zone);
>>+ int zid = zone_idx(zone);
>>+
>>+ return &memcg->nodeinfo[nid]->zoneinfo[zid];
>>+}
>>+
>>+/**
>>+ * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
>>+ * @zone: zone of the wanted lruvec
>>+ * @memcg: memcg of the wanted lruvec
>>+ *
>>+ * Returns the lru list vector holding pages for the given @zone and
>>+ * @mem. This can be the global zone lruvec, if the memory controller
>>+ * is disabled.
>>+ */
>>+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>>+ struct mem_cgroup *memcg)
>>+{
>>+ struct mem_cgroup_per_zone *mz;
>>+ struct lruvec *lruvec;
>>+
>>+ if (mem_cgroup_disabled()) {
>>+ lruvec = &zone->lruvec;
>>+ goto out;
>>+ }
>>+
>>+ mz = mem_cgroup_zone_zoneinfo(memcg, zone);
>>+ lruvec = &mz->lruvec;
>>+out:
>>+ /*
>>+ * Since a node can be onlined after the mem_cgroup was created,
>>+ * we have to be prepared to initialize lruvec->zone here;
>>+ * and if offlined then reonlined, we need to reinitialize it.
>>+ */
>>+ if (unlikely(lruvec->zone != zone))
>>+ lruvec->zone = zone;
>>+ return lruvec;
>>+}
>>+
>> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>>
>> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>>diff --git a/include/linux/mm.h b/include/linux/mm.h
>>index b530c99e8e81..a9dd54e196a7 100644
>>--- a/include/linux/mm.h
>>+++ b/include/linux/mm.h
>>@@ -943,11 +943,19 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
>> {
>> return page->mem_cgroup;
>> }
>>+static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>>+{
>>+ return READ_ONCE(page->mem_cgroup);
>>+}
>> #else
>> static inline struct mem_cgroup *page_memcg(struct page *page)
>> {
>> return NULL;
>> }
>>+static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>>+{
>>+ return NULL;
>>+}
>> #endif
>>
>> /*
>>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>index b3f16ab4b431..f65e5e527864 100644
>>--- a/mm/memcontrol.c
>>+++ b/mm/memcontrol.c
>>@@ -323,15 +323,6 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>>
>> #endif /* !CONFIG_SLOB */
>>
>>-static struct mem_cgroup_per_zone *
>>-mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>>-{
>>- int nid = zone_to_nid(zone);
>>- int zid = zone_idx(zone);
>>-
>>- return &memcg->nodeinfo[nid]->zoneinfo[zid];
>>-}
>>-
>> /**
>> * mem_cgroup_css_from_page - css of the memcg associated with a page
>> * @page: page of interest
>>@@ -944,39 +935,6 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
>> iter = mem_cgroup_iter(NULL, iter, NULL))
>>
>> /**
>>- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
>>- * @zone: zone of the wanted lruvec
>>- * @memcg: memcg of the wanted lruvec
>>- *
>>- * Returns the lru list vector holding pages for the given @zone and
>>- * @mem. This can be the global zone lruvec, if the memory controller
>>- * is disabled.
>>- */
>>-struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>>- struct mem_cgroup *memcg)
>>-{
>>- struct mem_cgroup_per_zone *mz;
>>- struct lruvec *lruvec;
>>-
>>- if (mem_cgroup_disabled()) {
>>- lruvec = &zone->lruvec;
>>- goto out;
>>- }
>>-
>>- mz = mem_cgroup_zone_zoneinfo(memcg, zone);
>>- lruvec = &mz->lruvec;
>>-out:
>>- /*
>>- * Since a node can be onlined after the mem_cgroup was created,
>>- * we have to be prepared to initialize lruvec->zone here;
>>- * and if offlined then reonlined, we need to reinitialize it.
>>- */
>>- if (unlikely(lruvec->zone != zone))
>>- lruvec->zone = zone;
>>- return lruvec;
>>-}
>>-
>>-/**
>> * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
>> * @page: the page
>> * @zone: zone of the page
>>diff --git a/mm/workingset.c b/mm/workingset.c
>>index 8a75f8d2916a..8252de4566e9 100644
>>--- a/mm/workingset.c
>>+++ b/mm/workingset.c
>>@@ -305,9 +305,10 @@ bool workingset_refault(void *shadow)
>> */
>> void workingset_activation(struct page *page)
>> {
>>+ struct mem_cgroup *memcg;
>> struct lruvec *lruvec;
>>
>>- lock_page_memcg(page);
>>+ rcu_read_lock();
>> /*
>> * Filter non-memcg pages here, e.g. unmap can call
>> * mark_page_accessed() on VDSO pages.
>>@@ -315,12 +316,13 @@ void workingset_activation(struct page *page)
>> * XXX: See workingset_refault() - this should return
>> * root_mem_cgroup even for !CONFIG_MEMCG.
>> */
>>- if (!mem_cgroup_disabled() && !page_memcg(page))
>>+ memcg = page_memcg_rcu(page);
>>+ if (!mem_cgroup_disabled() && !memcg)
>> goto out;
>>- lruvec = mem_cgroup_zone_lruvec(page_zone(page), page_memcg(page));
>>+ lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
>> atomic_long_inc(&lruvec->inactive_age);
>> out:
>>- unlock_page_memcg(page);
>>+ rcu_read_unlock();
>> }
>>
>> /*
>>--
>>2.8.2
>>
>_______________________________________________
>LKP mailing list
>LKP@xxxxxxxxxxxx
>https://lists.01.org/mailman/listinfo/lkp