Re: [PATCH v3 03/19] mm: memcg: convert vmstat slab counters to bytes

From: Roman Gushchin
Date: Wed May 20 2020 - 15:27:25 EST


On Wed, May 20, 2020 at 02:25:22PM +0200, Vlastimil Babka wrote:
> On 4/22/20 10:46 PM, Roman Gushchin wrote:
> > In order to prepare for per-object slab memory accounting, convert
> > NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.
> >
> > To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
> > NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).
> >
> > Internally global and per-node counters are stored in pages,
> > however memcg and lruvec counters are stored in bytes.
> > This scheme may look weird, but only for now. As soon as slab
> > pages will be shared between multiple cgroups, global and
> > node counters will reflect the total number of slab pages.
> > However memcg and lruvec counters will be used for per-memcg
> > slab memory tracking, which will take separate kernel objects
> > in the account. Keeping global and node counters in pages helps
> > to avoid additional overhead.
> >
> > The size of slab memory shouldn't exceed 4Gb on 32-bit machines,
> > so it will fit into atomic_long_t we use for vmstats.
> >
> > Signed-off-by: Roman Gushchin <guro@xxxxxx>
> > ---
> > drivers/base/node.c | 4 ++--
> > fs/proc/meminfo.c | 4 ++--
> > include/linux/mmzone.h | 16 +++++++++++++---
> > kernel/power/snapshot.c | 2 +-
> > mm/memcontrol.c | 11 ++++-------
> > mm/oom_kill.c | 2 +-
> > mm/page_alloc.c | 8 ++++----
> > mm/slab.h | 15 ++++++++-------
> > mm/slab_common.c | 4 ++--
> > mm/slob.c | 12 ++++++------
> > mm/slub.c | 8 ++++----
> > mm/vmscan.c | 3 ++-
> > mm/workingset.c | 6 ++++--
> > 13 files changed, 53 insertions(+), 42 deletions(-)
>
>
> > @@ -206,7 +206,17 @@ enum node_stat_item {
> >
> > static __always_inline bool vmstat_item_in_bytes(enum node_stat_item item)
> > {
> > - return false;
> > + /*
> > + * Global and per-node slab counters track slab pages.
> > + * It's expected that changes are multiples of PAGE_SIZE.
> > + * Internally values are stored in pages.
> > + *
> > + * Per-memcg and per-lruvec counters track memory, consumed
> > + * by individual slab objects. These counters are actually
> > + * byte-precise.
> > + */
> > + return (item == NR_SLAB_RECLAIMABLE_B ||
> > + item == NR_SLAB_UNRECLAIMABLE_B);

Hello, Vlastimil!

Thank you for looking into the patchset, appreciate it.
In the next version I'll add some comments based on your suggestions in previous
letters.

> > }
>
> Ok, so this is no longer a no-op, but __always_inline here and inline in
> global_node_page_state() should hopefully mean that for all users of
> global_node_page_state(<constant>) the compiler will eliminate the branch for
> non-slab counters. But there are also functions such as si_mem_available() that
> use non-constant item. Maybe compiler is smart enough anyway, but perhaps it's
> better to use global_node_page_state_pages() in such callers?

I'll take a look, thanks for the idea.

>
> However __mod_node_page_state() and mode_node_state() will now branch always. I
> wonder if the "API clean" goal is worth it...

You mean just adding a special write-side helper which will perform a conversion
and put VM_WARN_ON_ONCE() into generic write-side helpers?

>
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1409,9 +1409,8 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> > (u64)memcg_page_state(memcg, MEMCG_KERNEL_STACK_KB) *
> > 1024);
> > seq_buf_printf(&s, "slab %llu\n",
> > - (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) +
> > - memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE)) *
> > - PAGE_SIZE);
> > + (u64)(memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B) +
> > + memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B)));
> > seq_buf_printf(&s, "sock %llu\n",
> > (u64)memcg_page_state(memcg, MEMCG_SOCK) *
> > PAGE_SIZE);
> > @@ -1445,11 +1444,9 @@ static char *memory_stat_format(struct mem_cgroup *memcg)
> > PAGE_SIZE);
> >
> > seq_buf_printf(&s, "slab_reclaimable %llu\n",
> > - (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE) *
> > - PAGE_SIZE);
> > + (u64)memcg_page_state(memcg, NR_SLAB_RECLAIMABLE_B));
> > seq_buf_printf(&s, "slab_unreclaimable %llu\n",
> > - (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE) *
> > - PAGE_SIZE);
> > + (u64)memcg_page_state(memcg, NR_SLAB_UNRECLAIMABLE_B));
>
> So here we are now printing in bytes instead of pages, right? That's fine for
> OOM report, but in sysfs aren't we breaking existing users?
>

Hm, but it was in bytes previously, look at that x * PAGE_SIZE.
Or do you mean that now it can be not rounded to PAGE_SIZE?
If so, I don't think it breaks any expectations.

Thanks!