Re: [LKP] Re: [mm/memcg] bd0b230fe1: will-it-scale.per_process_ops -22.7% regression

From: Waiman Long
Date: Thu Nov 12 2020 - 11:44:07 EST


On 11/12/20 9:16 AM, Michal Hocko wrote:
On Thu 12-11-20 20:28:44, Feng Tang wrote:
Hi Michal,

On Wed, Nov 04, 2020 at 09:15:46AM +0100, Michal Hocko wrote:
Hi Michal,

We used the default configure of cgroups, not sure what configuration you
want,
could you give me more details? and here is the cgroup info of will-it-scale
process:

$ cat /proc/3042/cgroup
12:hugetlb:/
11:memory:/system.slice/lkp-bootstrap.service
OK, this means that memory controler is enabled and in use. Btw. do you
get the original performance if you add one phony page_counter after the
union?

I add one phony page_counter after the union and re-test, the regression
reduced to -1.2%. It looks like the regression caused by the data structure
layout change.
Thanks for double checking. Could you try to cache align the
page_counter struct? If that helps then we should figure which counters
acks against each other by adding the alignement between the respective
counters.
We tried below patch to make the 'page_counter' aligned.
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index bab7e57..9efa6f7 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -26,7 +26,7 @@ struct page_counter {
/* legacy */
unsigned long watermark;
unsigned long failcnt;
-};
+} ____cacheline_internodealigned_in_smp;
and with it, the -22.7% peformance change turns to a small -1.7%, which
confirms the performance bump is caused by the change to data alignment.

After the patch, size of 'page_counter' increases from 104 bytes to 128
bytes, and the size of 'mem_cgroup' increases from 2880 bytes to 3008
bytes(with our kernel config). Another major data structure which
contains 'page_counter' is 'hugetlb_cgroup', whose size will change
from 912B to 1024B.

Should we make these page_counters aligned to reduce cacheline conflict?
I would rather focus on a more effective mem_cgroup layout. It is very
likely that we are just stumbling over two counters here.

Could you try to add cache alignment of counters after memory and see
which one makes the difference? I do not expect memsw to be the one
because that one is used together with the main counter. But who knows
maybe the way it crosses the cache line has the exact effect. Hard to
tell without other numbers.

Btw. it would be great to see what the effect is on cgroup v2 as well.

Thanks for pursuing this!

The contention may be in the page counters themselves or it can be in other fields below the page counters. The cacheline alignment will cause "high_work" just after the page counters to start at a cacheline boundary. I will try removing the cacheline alignment in the page counter and add it to high_work to see there is any change in performance. If there is no change, the performance problem will not be in the page counters.

Cheers,
Longman