Re: [mm] 23047a96d7: vm-scalability.throughput -23.8% regression

From: Ye Xiaolong
Date: Wed May 25 2016 - 02:07:58 EST


On Mon, May 23, 2016 at 04:46:05PM -0400, Johannes Weiner wrote:
>Hi,
>
>thanks for your report.
>
>On Tue, May 17, 2016 at 12:58:05PM +0800, kernel test robot wrote:
>> FYI, we noticed vm-scalability.throughput -23.8% regression due to commit:
>>
>> commit 23047a96d7cfcfca1a6d026ecaec526ea4803e9e ("mm: workingset: per-cgroup cache thrash detection")
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git master
>>
>> in testcase: vm-scalability
>> on test machine: lkp-hsw01: 56 threads Grantley Haswell-EP with 64G memory
>> with following conditions: cpufreq_governor=performance/runtime=300s/test=lru-file-readtwice
>
>That test hammers the LRU activation path, to which this patch added
>the cgroup lookup and pinning code. Does the following patch help?
>

Hi,

Here is the comparison of original first bad commit (23047a96d) and your new patch (063f6715e).
vm-scalability.throughput improved 11.3%.


compiler/cpufreq_governor/kconfig/rootfs/runtime/tbox_group/test/testcase:
gcc-4.9/performance/x86_64-rhel/debian-x86_64-2015-02-07.cgz/300s/lkp-hsw01/lru-file-readtwice/vm-scalability

commit:
23047a96d7cfcfca1a6d026ecaec526ea4803e9e
063f6715e77a7be5770d6081fe6d7ca2437ac9f2

23047a96d7cfcfca 063f6715e77a7be5770d6081fe
---------------- --------------------------
%stddev %change %stddev
\ | \
21621405 ± 0% +11.3% 24069657 ± 2% vm-scalability.throughput
1711141 ± 0% +40.9% 2411083 ± 2% vm-scalability.time.involuntary_context_switches
2747 ± 0% +2.4% 2812 ± 0% vm-scalability.time.maximum_resident_set_size
5243 ± 0% -1.2% 5180 ± 0% vm-scalability.time.percent_of_cpu_this_job_got
136.95 ± 1% +13.6% 155.55 ± 0% vm-scalability.time.user_time
208386 ± 0% -71.5% 59394 ± 16% vm-scalability.time.voluntary_context_switches
1.38 ± 2% +21.7% 1.69 ± 2% perf-profile.cycles-pp.kswapd
160522 ± 5% -30.0% 112342 ± 2% softirqs.SCHED
2536 ± 0% +7.3% 2722 ± 2% uptime.idle
1711141 ± 0% +40.9% 2411083 ± 2% time.involuntary_context_switches
136.95 ± 1% +13.6% 155.55 ± 0% time.user_time
208386 ± 0% -71.5% 59394 ± 16% time.voluntary_context_switches
1052 ± 13% +1453.8% 16346 ± 39% cpuidle.C1-HSW.usage
1045 ± 12% -54.3% 477.50 ± 25% cpuidle.C3-HSW.usage
5.719e+08 ± 1% +17.9% 6.743e+08 ± 0% cpuidle.C6-HSW.time
40424411 ± 2% -97.3% 1076732 ± 99% cpuidle.POLL.time
7179 ± 5% -99.9% 6.50 ± 53% cpuidle.POLL.usage
0.51 ± 8% -40.6% 0.30 ± 13% turbostat.CPU%c1
2.83 ± 2% +30.5% 3.70 ± 0% turbostat.CPU%c6
0.23 ± 79% +493.4% 1.35 ± 2% turbostat.Pkg%pc2
255.52 ± 0% +3.3% 263.95 ± 0% turbostat.PkgWatt
53.26 ± 0% +14.9% 61.22 ± 0% turbostat.RAMWatt
1836104 ± 0% +13.3% 2079934 ± 4% vmstat.memory.free
5.00 ± 0% -70.0% 1.50 ± 33% vmstat.procs.b
107.00 ± 0% +8.4% 116.00 ± 2% vmstat.procs.r
18866 ± 2% +40.1% 26436 ± 13% vmstat.system.cs
69056 ± 0% +11.8% 77219 ± 1% vmstat.system.in
31628132 ± 0% +80.9% 57224963 ± 0% meminfo.Active
31294504 ± 0% +81.7% 56876042 ± 0% meminfo.Active(file)
142271 ± 6% +11.2% 158138 ± 5% meminfo.DirectMap4k
30612825 ± 0% -87.2% 3915695 ± 0% meminfo.Inactive
30562772 ± 0% -87.4% 3862631 ± 0% meminfo.Inactive(file)
15635 ± 1% +38.0% 21572 ± 8% meminfo.KernelStack
22575 ± 2% +7.7% 24316 ± 4% meminfo.Mapped
1762372 ± 3% +12.2% 1976873 ± 3% meminfo.MemFree
847557 ± 0% +105.5% 1741958 ± 8% meminfo.SReclaimable
946378 ± 0% +95.1% 1846370 ± 8% meminfo.Slab

Thanks,
Xiaolong

>From b535c630fd8954865b7536c915c3916beb3b4830 Mon Sep 17 00:00:00 2001
>From: Johannes Weiner <hannes@xxxxxxxxxxx>
>Date: Mon, 23 May 2016 16:14:24 -0400
>Subject: [PATCH] mm: fix vm-scalability regression in workingset_activation()
>
>23047a96d7cf ("mm: workingset: per-cgroup cache thrash detection")
>added cgroup lookup and pinning overhead to the LRU activation path,
>which the vm-scalability benchmark is particularly sensitive to.
>
>Inline the lookup functions to eliminate calls. Furthermore, since
>activations are not moved when pages are moved between memcgs, we
>don't need the full page->mem_cgroup locking; holding the RCU lock is
>enough to prevent the memcg from being freed.
>
>Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
>---
> include/linux/memcontrol.h | 43 ++++++++++++++++++++++++++++++++++++++++++-
> include/linux/mm.h | 8 ++++++++
> mm/memcontrol.c | 42 ------------------------------------------
> mm/workingset.c | 10 ++++++----
> 4 files changed, 56 insertions(+), 47 deletions(-)
>
>diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>index a805474df4ab..0bb36cf89bf6 100644
>--- a/include/linux/memcontrol.h
>+++ b/include/linux/memcontrol.h
>@@ -306,7 +306,48 @@ void mem_cgroup_uncharge_list(struct list_head *page_list);
>
> void mem_cgroup_migrate(struct page *oldpage, struct page *newpage);
>
>-struct lruvec *mem_cgroup_zone_lruvec(struct zone *, struct mem_cgroup *);
>+static inline struct mem_cgroup_per_zone *
>+mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>+{
>+ int nid = zone_to_nid(zone);
>+ int zid = zone_idx(zone);
>+
>+ return &memcg->nodeinfo[nid]->zoneinfo[zid];
>+}
>+
>+/**
>+ * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
>+ * @zone: zone of the wanted lruvec
>+ * @memcg: memcg of the wanted lruvec
>+ *
>+ * Returns the lru list vector holding pages for the given @zone and
>+ * @mem. This can be the global zone lruvec, if the memory controller
>+ * is disabled.
>+ */
>+static inline struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>+ struct mem_cgroup *memcg)
>+{
>+ struct mem_cgroup_per_zone *mz;
>+ struct lruvec *lruvec;
>+
>+ if (mem_cgroup_disabled()) {
>+ lruvec = &zone->lruvec;
>+ goto out;
>+ }
>+
>+ mz = mem_cgroup_zone_zoneinfo(memcg, zone);
>+ lruvec = &mz->lruvec;
>+out:
>+ /*
>+ * Since a node can be onlined after the mem_cgroup was created,
>+ * we have to be prepared to initialize lruvec->zone here;
>+ * and if offlined then reonlined, we need to reinitialize it.
>+ */
>+ if (unlikely(lruvec->zone != zone))
>+ lruvec->zone = zone;
>+ return lruvec;
>+}
>+
> struct lruvec *mem_cgroup_page_lruvec(struct page *, struct zone *);
>
> bool task_in_mem_cgroup(struct task_struct *task, struct mem_cgroup *memcg);
>diff --git a/include/linux/mm.h b/include/linux/mm.h
>index b530c99e8e81..a9dd54e196a7 100644
>--- a/include/linux/mm.h
>+++ b/include/linux/mm.h
>@@ -943,11 +943,19 @@ static inline struct mem_cgroup *page_memcg(struct page *page)
> {
> return page->mem_cgroup;
> }
>+static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>+{
>+ return READ_ONCE(page->mem_cgroup);
>+}
> #else
> static inline struct mem_cgroup *page_memcg(struct page *page)
> {
> return NULL;
> }
>+static inline struct mem_cgroup *page_memcg_rcu(struct page *page)
>+{
>+ return NULL;
>+}
> #endif
>
> /*
>diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>index b3f16ab4b431..f65e5e527864 100644
>--- a/mm/memcontrol.c
>+++ b/mm/memcontrol.c
>@@ -323,15 +323,6 @@ EXPORT_SYMBOL(memcg_kmem_enabled_key);
>
> #endif /* !CONFIG_SLOB */
>
>-static struct mem_cgroup_per_zone *
>-mem_cgroup_zone_zoneinfo(struct mem_cgroup *memcg, struct zone *zone)
>-{
>- int nid = zone_to_nid(zone);
>- int zid = zone_idx(zone);
>-
>- return &memcg->nodeinfo[nid]->zoneinfo[zid];
>-}
>-
> /**
> * mem_cgroup_css_from_page - css of the memcg associated with a page
> * @page: page of interest
>@@ -944,39 +935,6 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
> iter = mem_cgroup_iter(NULL, iter, NULL))
>
> /**
>- * mem_cgroup_zone_lruvec - get the lru list vector for a zone and memcg
>- * @zone: zone of the wanted lruvec
>- * @memcg: memcg of the wanted lruvec
>- *
>- * Returns the lru list vector holding pages for the given @zone and
>- * @mem. This can be the global zone lruvec, if the memory controller
>- * is disabled.
>- */
>-struct lruvec *mem_cgroup_zone_lruvec(struct zone *zone,
>- struct mem_cgroup *memcg)
>-{
>- struct mem_cgroup_per_zone *mz;
>- struct lruvec *lruvec;
>-
>- if (mem_cgroup_disabled()) {
>- lruvec = &zone->lruvec;
>- goto out;
>- }
>-
>- mz = mem_cgroup_zone_zoneinfo(memcg, zone);
>- lruvec = &mz->lruvec;
>-out:
>- /*
>- * Since a node can be onlined after the mem_cgroup was created,
>- * we have to be prepared to initialize lruvec->zone here;
>- * and if offlined then reonlined, we need to reinitialize it.
>- */
>- if (unlikely(lruvec->zone != zone))
>- lruvec->zone = zone;
>- return lruvec;
>-}
>-
>-/**
> * mem_cgroup_page_lruvec - return lruvec for isolating/putting an LRU page
> * @page: the page
> * @zone: zone of the page
>diff --git a/mm/workingset.c b/mm/workingset.c
>index 8a75f8d2916a..8252de4566e9 100644
>--- a/mm/workingset.c
>+++ b/mm/workingset.c
>@@ -305,9 +305,10 @@ bool workingset_refault(void *shadow)
> */
> void workingset_activation(struct page *page)
> {
>+ struct mem_cgroup *memcg;
> struct lruvec *lruvec;
>
>- lock_page_memcg(page);
>+ rcu_read_lock();
> /*
> * Filter non-memcg pages here, e.g. unmap can call
> * mark_page_accessed() on VDSO pages.
>@@ -315,12 +316,13 @@ void workingset_activation(struct page *page)
> * XXX: See workingset_refault() - this should return
> * root_mem_cgroup even for !CONFIG_MEMCG.
> */
>- if (!mem_cgroup_disabled() && !page_memcg(page))
>+ memcg = page_memcg_rcu(page);
>+ if (!mem_cgroup_disabled() && !memcg)
> goto out;
>- lruvec = mem_cgroup_zone_lruvec(page_zone(page), page_memcg(page));
>+ lruvec = mem_cgroup_zone_lruvec(page_zone(page), memcg);
> atomic_long_inc(&lruvec->inactive_age);
> out:
>- unlock_page_memcg(page);
>+ rcu_read_unlock();
> }
>
> /*
>--
>2.8.2
>