Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

From: Tadeusz Struk
Date: Wed Jun 01 2022 - 19:37:24 EST


On 6/1/22 16:20, Tejun Heo wrote:
On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
// ref=A: initial state
kill_css()
css_get // ref+=F == A+F: fuse
percpu_ref_kill_and_confirm
__percpu_ref_switch_to_atomic
percpu_ref_get
// ref += 1 == A+F+1: atomic mode, self-protection
percpu_ref_put
// ref -= 1 == A+F: kill the base reference
[via rcu]
percpu_ref_switch_to_atomic_rcu
percpu_ref_call_confirm_rcu
css_killed_ref_fn == refcnt.confirm_switch
queue_work(css->destroy_work) (1)
[via css->destroy_work]
css_killed_work_fn == wq.func
offline_css() // needs fuse
css_put // ref -= F == A: de-fuse
percpu_ref_put
// ref -= 1 == A-1: remove self-protection
css_release // A <= 1 -> 2nd queue_work explodes!

I'm not sure I'm following it but it's perfectly fine to re-use the work
item at this point. The work item actually can be re-cycled from the very
beginning of the work function. The only thing we need to make sure is that
we don't css_put() prematurely to avoid it being freed while we're using it.

Yes, it is ok to reuse a work struct, but it's not ok to have the same
work struct enqueued twice on the same WQ when list debug is enabled.
That's why we are getting this "BUG: corrupted list.."

The above scenario isn't that tho. Once the work item starts executing, wq
doesn't care about what happens to it and as killed_work_fn is holding a
reference, the release scheduling shouldn't happen before it starts
executing unless somebody is screwing up the refcnting.

That's right. Michal was on the right track for the kill_css() part.
What I think is going on is that once css_create() fails then
cgroup_subtree_control_write() ends up calling first kill_css() and
then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
The each_live_descendant_post() also iterates on the root.
Here is the call flow (sorry for long lines):

cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
|
|-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
|
| css_killed_ref_fn() <- first css->destroy_work enqueue
| |
| |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
|
|
|-> goto out_unlock;
| |
| |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
|
|
css_release(percpu_ref) <- second css->destroy_work enqueue
|
|-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.


What seems to work for me as the simplest fix is to prevent enqueuing a dying
css in css_release() as below. Please let me know if that makes sense to you.

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..5618211487cc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);
- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ if (!(css->flags & CSS_DYING)) {
+ INIT_WORK(&css->destroy_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
+ }

When the problem is ref imbalance, how can above be the solution? Of course
release path won't cause an issue if they don't run, but we still need to
free the thing, right?

Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
on it. Only css_get() & css_put() modify the refcnt value.
And for the "free the thing" the css_killed_work_fn() does that.
It calls offline_css(css) and css_put(css) for the whole css hierarchy.

--
Thanks,
Tadeusz