Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API

From: Roman Gushchin
Date: Fri May 15 2020 - 18:02:08 EST

Next message: Ian Rogers: "Re: [PATCH v2 7/7] perf expr: Migrate expr ids table to a hashmap"
Previous message: Stephane Eranian: "[PATCH 3/3] perf/x86/rapl: add AMD Fam17h RAPL support"
In reply to: Johannes Weiner: "Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, May 12, 2020 at 06:56:45PM -0400, Johannes Weiner wrote:
> On Thu, May 07, 2020 at 03:26:31PM -0700, Roman Gushchin wrote:
> > On Thu, May 07, 2020 at 05:03:14PM -0400, Johannes Weiner wrote:
> > > On Wed, Apr 22, 2020 at 01:46:55PM -0700, Roman Gushchin wrote:
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -257,6 +257,78 @@ struct cgroup_subsys_state *vmpressure_to_css(struct vmpressure *vmpr)
> > > > }
> > > >
> > > > #ifdef CONFIG_MEMCG_KMEM
> > > > +extern spinlock_t css_set_lock;
> > > > +
> > > > +static void obj_cgroup_release(struct percpu_ref *ref)
> > > > +{
> > > > + struct obj_cgroup *objcg = container_of(ref, struct obj_cgroup, refcnt);
> > > > + struct mem_cgroup *memcg;
> > > > + unsigned int nr_bytes;
> > > > + unsigned int nr_pages;
> > > > + unsigned long flags;
> > > > +
> > > > + nr_bytes = atomic_read(&objcg->nr_charged_bytes);
> > > > + WARN_ON_ONCE(nr_bytes & (PAGE_SIZE - 1));
> > > > + nr_pages = nr_bytes >> PAGE_SHIFT;
> > >
> > > What guarantees that we don't have a partial page in there at this
> > > point? I guess any outstanding allocations would pin the objcg, so
> > > when it's released all objects have been freed.
> >
> > Right, this is exactly the reason why there can't be a partial page
> > at this point.
> >
> > >
> > > But if that's true, how can we have full pages remaining in there now?
> >
> > Imagine the following sequence:
> > 1) CPU0: objcg == stock->cached_objcg
> > 2) CPU1: we do a small allocation (e.g. 92 bytes), page is charged
> > 3) CPU1: a process from another memcg is allocating something, stock if flushed,
> > objcg->nr_charged_bytes = PAGE_SIZE - 92
> > 5) CPU0: we do release this object, 92 bytes are added to stock->nr_bytes
> > 6) CPU0: stock is flushed, 92 bytes are added to objcg->nr_charged_bytes
> >
> > In the result, nr_charged_bytes == PAGE_SIZE. This PAGE will be uncharged
> > in obj_cgroup_release().
> >
> > I've double checked this, it's actually pretty easy to trigger in the real life.
>
> Ah, so no outstanding allocations, but a full page split between the
> percpu cache and objcg->nr_charged_bytes.
>
> Would it simplify things if refill_obj_stock() drained on >= PAGE_SIZE
> stock instead of > PAGE_SIZE?
>
> Otherwise, the scenario above would be good to have as a comment as
> the drain on release is not self-explanatory.
>
> > > > +int obj_cgroup_charge(struct obj_cgroup *objcg, gfp_t gfp, size_t size)
> > > > +{
> > > > + struct mem_cgroup *memcg;
> > > > + unsigned int nr_pages, nr_bytes;
> > > > + int ret;
> > > > +
> > > > + if (consume_obj_stock(objcg, size))
> > > > + return 0;
> > > > +
> > > > + rcu_read_lock();
> > > > + memcg = obj_cgroup_memcg(objcg);
> > > > + css_get(&memcg->css);
> > > > + rcu_read_unlock();
> > > > +
> > > > + nr_pages = size >> PAGE_SHIFT;
> > > > + nr_bytes = size & (PAGE_SIZE - 1);
> > > > +
> > > > + if (nr_bytes)
> > > > + nr_pages += 1;
> > > > +
> > > > + ret = __memcg_kmem_charge(memcg, gfp, nr_pages);
> > >
> > > If consume_obj_stock() fails because some other memcg is cached,
> > > should this try to consume the partial page in objcg->nr_charged_bytes
> > > before getting more pages?
> >
> > We can definitely do it, but I'm not sure if it's good for the performance.
> >
> > Dealing with nr_charged_bytes will require up to two atomic writes,
> > so calling __memcg_kmem_charge() can be faster if memcg is cached
> > on percpu stock.
>
> Hm, but it's the slowpath. And sooner or later somebody has to deal
> with the remaining memory in there.
>
> > > > + if (!ret && nr_bytes)
> > > > + refill_obj_stock(objcg, PAGE_SIZE - nr_bytes);
> > >
> > > This will put the cgroup into the cache if the allocation resulted in
> > > a partially free page.
> > >
> > > But if this was a page allocation, we may have objcg->nr_cache_bytes
> > > from a previous subpage allocation that we should probably put back
> > > into the stock.
> >
> > Yeah, we can do this, but I don't know if there will be any benefits.
>
> It's mostly about understanding the code.
>
> > Actually we don't wanna to touch objcg->nr_cache_bytes too often, as
> > it can become a contention point if there are many threads allocating
> > in the memory cgroup.
> >
> > So maybe we want to do the opposite: relax it a bit and stop flushing
> > it on every stock refill and flush only if it exceeds a certain value.
>
> That could be useful, yes.
>
> > > It's not a common case, I'm just trying to understand what
> > > objcg->nr_cache_bytes holds and when it does so.
> >
> > So it's actually a centralized leftover from the rounding of the actual
> > charge to the page size.
>
> It would be good to add code comments explaining this.

Thanks you for comments! Please, find the updated version with some added
explanations below. No functional changes, just added some comments here
and there. Thanks!

--

Next message: Ian Rogers: "Re: [PATCH v2 7/7] perf expr: Migrate expr ids table to a hashmap"
Previous message: Stephane Eranian: "[PATCH 3/3] perf/x86/rapl: add AMD Fam17h RAPL support"
In reply to: Johannes Weiner: "Re: [PATCH v3 06/19] mm: memcg/slab: obj_cgroup API"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]