Re: [PATCH v2] cgroup: serialize css kill and release paths
From: Tadeusz Struk
Date: Tue Jun 07 2022 - 17:56:15 EST
On 6/6/22 05:39, Michal Koutný wrote:
On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk<tadeusz.struk@xxxxxxxxxx> wrote:
In such scenario the css_killed_work_fn will be en-queued via
cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
css_release_work_fn for the same css instance, causing a list_add
corruption bug, as can be seen in the syzkaller report [1].
This hypothesis doesn't add up to me (I am sorry).
The kill_css(css) would be a css associated with a subsys (css.ss !=
NULL) whereas css_put(&cgrp->self) is a different css just for the
cgroup (css.ss == NULL).
Yes, you are right. I couldn't figure it out where the extra css_put()
is called from, and the only place that fitted into my theory was from
the cgroup_kn_unlock() in cgroup_apply_control_disable().
After some more debugging I can see that, as you said, the cgrp->self
is a different css. The offending _put() is actually called by the
percpu_ref_kill_and_confirm(), as it not only calls the passed confirm_kill
percpu_ref_func_t, but also it puts the refcnt iself.
Because the cgroup_apply_control_disable() will loop for_each_live_descendant,
and call css_kill() on all css'es, and css_killed_work_fn() will also loop
and call css_put() on all parents, the css_release() will be called on the
first parent prematurely, causing the BUG(). What I think should be done
to balance put/get is to call css_get() for all the parents in kill_css():
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c1e1a5c34e77..3ca61325bc4e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5527,6 +5527,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
*/
static void kill_css(struct cgroup_subsys_state *css)
{
+ struct cgroup_subsys_state *_css = css;
+
lockdep_assert_held(&cgroup_mutex);
if (css->flags & CSS_DYING)
@@ -5541,10 +5543,13 @@ static void kill_css(struct cgroup_subsys_state *css)
css_clear_dir(css);
/*
- * Killing would put the base ref, but we need to keep it alive
- * until after ->css_offline().
+ * Killing would put the base ref, but we need to keep it alive,
+ * and all its parents, until after ->css_offline().
*/
- css_get(css);
+ do {
+ css_get(_css);
+ _css = _css->parent;
+ } while (_css && atomic_read(&_css->online_cnt));
/*
* cgroup core guarantees that, by the time ->css_offline() is
This will be then "reverted" in css_killed_work_fn()
Please let me know if it makes sense to you.
I'm still testing it, but syzbot is very slow today.
--
Thanks,
Tadeusz