Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API

From: Roman Gushchin
Date: Thu May 07 2020 - 18:26:48 EST

Next message: Roman Gushchin: "Re: [PATCH] mm: vmscan: consistent update to pgsteal and pgscan"
Previous message: Casey Schaufler: "Re: [PATCH 1/1] Smack:- Remove redundant inode_smack cache since now new cache is being used for inode_smack structure allocations"
In reply to: Johannes Weiner: "Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote:
> On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> > }
> >
> > #ifdef CONFIG_MEMCG_KMEM
> > +extern spinlock_t css_set_lock;
> > +
> > +static void obj_cgroup_release(struct percpu_ref *ref)
> > +{
> > + struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > + struct mem_cgroup *memcg;
> > + unsigned int nr_bytes;
> > + unsigned int nr_pages;
> > + unsigned long flags;
> > +
> > + nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> > + WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > + nr_pages = nr_bytes >> PAGE_SHIFT;
>
> What guarantees that we don't have a partial page in there at this
> point? I guess any outstanding allocations would pin the objcg, so
> when it's released all objects have been freed.

Right, this is exactly the reason why there can't be a partial page
at this point.

>
> But if that's true, how can we have full pages remaining in there now?

Imagine the following sequence:
1) CPU0: objcg == stock->cached_objcg
2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged
3) CPU1: a process from another memcg is allocating something, stock if flushed,
objcg->nr_charged_bytes = PAGE_SIZE - 92
5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes
6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes

In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged
in obj_cgroup_release().

I've double checked this, it's actually pretty easy to trigger in the real life.

>
> > @@ -2723,6 +2820,30 @@ struct mem_cgroup *mem_cgroup_from_obj(void *p)
> > return page->mem_cgroup;
> > }
> >
> > +__always_inline struct obj_cgroup *get_obj_cgroup_from_current(void)
> > +{
> > + struct obj_cgroup *objcg = NULL;
> > + struct mem_cgroup *memcg;
> > +
> > + if (unlikely(!current->mm))
> > + return NULL;
> > +
> > + rcu_read_lock();
> > + if (unlikely(current->active_memcg))
> > + memcg = rcu_dereference(current->active_memcg);
> > + else
> > + memcg = mem_cgroup_from_task(current);
> > +
> > + for (; memcg != root_mem_cgroup; memcg = parent_mem_cgroup(memcg)) {
> > + objcg = rcu_dereference(memcg->objcg);
> > + if (objcg && obj_cgroup_tryget(objcg))
> > + break;
> > + }
> > + rcu_read_unlock();
> > +
> > + return objcg;
> > +}
>
> Thanks for moving this here from one of the later patches, it helps
> understanding the life cycle of obj_cgroup better.
>
> > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> > +{
> > + struct mem_cgroup *memcg;
> > + unsigned int nr_pages, nr_bytes;
> > + int ret;
> > +
> > + if (consume_obj_stock(objcg, size))
> > + return 0;
> > +
> > + rcu_read_lock();
> > + memcg = obj_cgroup_memcg(objcg);
> > + css_get(&memcg->css);
> > + rcu_read_unlock();
> > +
> > + nr_pages = size >> PAGE_SHIFT;
> > + nr_bytes = size & (PAGE_SIZE - 1);
> > +
> > + if (nr_bytes)
> > + nr_pages += 1;
> > +
> > + ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
>
> If consume_obj_stock() fails because some other memcg is cached,
> should this try to consume the partial page in objcg->nr_charged_bytes
> before getting more pages?

We can definitely do it, but I'm not sure if it's good for the performance.

Dealing with nr_charged_bytes will require up to two atomic writes,
so calling __memcg_kmem_charge() can be faster if memcg is cached
on percpu stock.

>
> > + if (!ret && nr_bytes)
> > + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
>
> This will put the cgroup into the cache if the allocation resulted in
> a partially free page.
>
> But if this was a page allocation, we may have objcg->nr_cache_bytes
> from a previous subpage allocation that we should probably put back
> into the stock.

Yeah, we can do this, but I don't know if there will be any benefits.

Actually we don't wanna to touch objcg->nr_cache_bytes too often, as
it can become a contention point if there are many threads allocating
in the memory cgroup.

So maybe we want to do the opposite: relax it a bit and stop flushing
it on every stock refill and flush only if it exceeds a certain value.

>
> It's not a common case, I'm just trying to understand what
> objcg->nr_cache_bytes holds and when it does so.

So it's actually a centralized leftover from the rounding of the actual
charge to the page size.

>
> The rest of this patch looks good to me!

Great!

Thank you very much for the review!

Next message: Roman Gushchin: "Re: [PATCH] mm: vmscan: consistent update to pgsteal and pgscan"
Previous message: Casey Schaufler: "Re: [PATCH 1/1] Smack:- Remove redundant inode_smack cache since now new cache is being used for inode_smack structure allocations"
In reply to: Johannes Weiner: "Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]