Re: [QUESTION] What memcg lifetime is required by list_lru_add?
From: Alice Ryhl
Date: Thu Nov 28 2024 - 07:27:28 EST
On Wed, Nov 27, 2024 at 11:05 PM Dave Chinner <david@xxxxxxxxxxxxx> wrote:
>
> On Wed, Nov 27, 2024 at 10:04:51PM +0100, Alice Ryhl wrote:
> > Dear SHRINKER and MEMCG experts,
> >
> > When using list_lru_add() and list_lru_del(), it seems to be required
> > that you pass the same value of nid and memcg to both calls, since
> > list_lru_del() might otherwise try to delete it from the wrong list /
> > delete it while holding the wrong spinlock. I'm trying to understand
> > the implications of this requirement on the lifetime of the memcg.
> >
> > Now, looking at list_lru_add_obj() I noticed that it uses rcu locking
> > to keep the memcg object alive for the duration of list_lru_add().
> > That rcu locking is used here seems to imply that without it, the
> > memcg could be deallocated during the list_lru_add() call, which is of
> > course bad. But rcu is not enough on its own to keep the memcg alive
> > all the way until the list_lru_del_obj() call, so how does it ensure
> > that the memcg stays valid for that long?
>
> We don't care if the memcg goes away whilst there are objects on the
> LRU. memcg destruction will reparent the objects to a different
> memcg via memcg_reparent_list_lrus() before the memcg is torn down.
> New objects should not be added to the memcg LRUs once the memcg
> teardown process starts, so there should never be add vs reparent
> races during teardown.
>
> Hence all the list_lru_add_obj() function needs to do is ensure that
> the locking/lifecycle rules for the memcg object that
> mem_cgroup_from_slab_obj() returns are obeyed.
>
> > And if there is a mechanism
> > to keep the memcg alive for the entire duration between add and del,
>
> It's enforced by the -complex- state machine used to tear down
> control groups.
>
> tl;dr: If the memcg gets torn down, it will reparent the objects on
> the LRU to it's parent memcg during the teardown process.
>
> This reparenting happens in the cgroup ->css_offline() method, which
> only happens after the cgroup reference count goes to zero and is
> waited on via:
>
> kill_css
> percpu_ref_kill_and_confirm(css_killed_ref_fn)
> <wait>
> css_killed_ref_fn
> offline_css
> mem_cgroup_css_offline
> memcg_offline_kmem
> {
> .....
> memcg_reparent_objcgs(memcg, parent);
>
> /*
> * After we have finished memcg_reparent_objcgs(), all list_lrus
> * corresponding to this cgroup are guaranteed to remain empty.
> * The ordering is imposed by list_lru_node->lock taken by
> * memcg_reparent_list_lrus().
> */
> memcg_reparent_list_lrus(memcg, parent)
> }
>
> Then the cgroup teardown control code then schedules the freeing
> of the memcg container via a RCU work callback when the reference
> count is globally visible as killed and the reference count has gone
> to zero.
>
> Hence the cgroup infrastructure requires RCU protection for the
> duration of unreferenced cgroup object accesses. This allows for
> subsystems to perform operations on the cgroup object without
> needing to holding cgroup references for every access. The complex,
> multi-stage teardown process allows for cgroup objects to release
> objects that it tracks hence avoiding the need for every object the
> cgroup tracks to hold a reference count on the cgroup.
>
> See the comment above css_free_rwork_fn() for more details about the
> teardown process:
>
> /*
> * css destruction is four-stage process.
> *
> * 1. Destruction starts. Killing of the percpu_ref is initiated.
> * Implemented in kill_css().
> *
> * 2. When the percpu_ref is confirmed to be visible as killed on all CPUs
> * and thus css_tryget_online() is guaranteed to fail, the css can be
> * offlined by invoking offline_css(). After offlining, the base ref is
> * put. Implemented in css_killed_work_fn().
> *
> * 3. When the percpu_ref reaches zero, the only possible remaining
> * accessors are inside RCU read sections. css_release() schedules the
> * RCU callback.
> *
> * 4. After the grace period, the css can be freed. Implemented in
> * css_free_rwork_fn().
> *
> * It is actually hairier because both step 2 and 4 require process context
> * and thus involve punting to css->destroy_work adding two additional
> * steps to the already complex sequence.
> */
Thanks a lot Dave, this clears it up for me.
I sent a patch containing some additional docs for list_lru:
https://lore.kernel.org/all/20241128-list_lru_memcg_docs-v1-1-7e4568978f4e@xxxxxxxxxx/
Alice