Re: [PATCH 3/5] mm/memcg: Cache vmstat data in percpu memcg_stock_pcp

From: Waiman Long
Date: Mon Apr 12 2021 - 15:30:20 EST


On 4/12/21 2:22 PM, Roman Gushchin wrote:
On Fri, Apr 09, 2021 at 07:18:40PM -0400, Waiman Long wrote:
Before the new slab memory controller with per object byte charging,
charging and vmstat data update happen only when new slab pages are
allocated or freed. Now they are done with every kmem_cache_alloc()
and kmem_cache_free(). This causes additional overhead for workloads
that generate a lot of alloc and free calls.

The memcg_stock_pcp is used to cache byte charge for a specific
obj_cgroup to reduce that overhead. To further reducing it, this patch
makes the vmstat data cached in the memcg_stock_pcp structure as well
until it accumulates a page size worth of update or when other cached
data change.
The idea makes total sense to me and also gives a hope to remove
byte-sized vmstats in the long-term.

On a 2-socket Cascade Lake server with instrumentation enabled and this
patch applied, it was found that about 17% (946796 out of 5515184) of the
time when __mod_obj_stock_state() is called leads to an actual call to
mod_objcg_state() after initial boot. When doing parallel kernel build,
the figure was about 16% (21894614 out of 139780628). So caching the
vmstat data reduces the number of calls to mod_objcg_state() by more
than 80%.

Signed-off-by: Waiman Long <longman@xxxxxxxxxx>
---
mm/memcontrol.c | 78 +++++++++++++++++++++++++++++++++++++++++++------
mm/slab.h | 26 +++++++----------
2 files changed, 79 insertions(+), 25 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b19100c68aa0..539c3b632e47 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2220,7 +2220,10 @@ struct memcg_stock_pcp {
#ifdef CONFIG_MEMCG_KMEM
struct obj_cgroup *cached_objcg;
+ struct pglist_data *cached_pgdat;
unsigned int nr_bytes;
+ int vmstat_idx;
+ int vmstat_bytes;
#endif
Because vmstat_idx can realistically take only 3 values (slab_reclaimable,
slab_unreclaimable and percpu), I wonder if it's better to have
vmstat_bytes[3] and save a bit more on the reduced number of flushes?
It must be an often case when a complex (reclaimable) kernel object has
non-reclaimable parts (e.g. kmallocs) or percpu counters.
If the difference will be too small, maybe the current form is better.

I have thought about that too. However, that will make the code more complex. So I decided to cache just one for now. We can certainly play around with caching more in a later patch.

Cheers,
Longman