Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

From: Tejun Heo
Date: Wed Jun 01 2022 - 19:21:02 EST


On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
> > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
> > > // ref=A: initial state
> > > kill_css()
> > > css_get // ref+=F == A+F: fuse
> > > percpu_ref_kill_and_confirm
> > > __percpu_ref_switch_to_atomic
> > > percpu_ref_get
> > > // ref += 1 == A+F+1: atomic mode, self-protection
> > > percpu_ref_put
> > > // ref -= 1 == A+F: kill the base reference
> > > [via rcu]
> > > percpu_ref_switch_to_atomic_rcu
> > > percpu_ref_call_confirm_rcu
> > > css_killed_ref_fn == refcnt.confirm_switch
> > > queue_work(css->destroy_work) (1)
> > > [via css->destroy_work]
> > > css_killed_work_fn == wq.func
> > > offline_css() // needs fuse
> > > css_put // ref -= F == A: de-fuse
> > > percpu_ref_put
> > > // ref -= 1 == A-1: remove self-protection
> > > css_release // A <= 1 -> 2nd queue_work explodes!
> >
> > I'm not sure I'm following it but it's perfectly fine to re-use the work
> > item at this point. The work item actually can be re-cycled from the very
> > beginning of the work function. The only thing we need to make sure is that
> > we don't css_put() prematurely to avoid it being freed while we're using it.
>
> Yes, it is ok to reuse a work struct, but it's not ok to have the same
> work struct enqueued twice on the same WQ when list debug is enabled.
> That's why we are getting this "BUG: corrupted list.."

The above scenario isn't that tho. Once the work item starts executing, wq
doesn't care about what happens to it and as killed_work_fn is holding a
reference, the release scheduling shouldn't happen before it starts
executing unless somebody is screwing up the refcnting.

> That's right. Michal was on the right track for the kill_css() part.
> What I think is going on is that once css_create() fails then
> cgroup_subtree_control_write() ends up calling first kill_css() and
> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
> The each_live_descendant_post() also iterates on the root.
> Here is the call flow (sorry for long lines):
>
> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
> |
> |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
> |
> | css_killed_ref_fn() <- first css->destroy_work enqueue
> | |
> | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
> |
> |
> |-> goto out_unlock;
> | |
> | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
> |
> |
> css_release(percpu_ref) <- second css->destroy_work enqueue
> |
> |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
>
>
> What seems to work for me as the simplest fix is to prevent enqueuing a dying
> css in css_release() as below. Please let me know if that makes sense to you.
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1779ccddb734..5618211487cc 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
> struct cgroup_subsys_state *css =
> container_of(ref, struct cgroup_subsys_state, refcnt);
> - INIT_WORK(&css->destroy_work, css_release_work_fn);
> - queue_work(cgroup_destroy_wq, &css->destroy_work);
> + if (!(css->flags & CSS_DYING)) {
> + INIT_WORK(&css->destroy_work, css_release_work_fn);
> + queue_work(cgroup_destroy_wq, &css->destroy_work);
> + }

When the problem is ref imbalance, how can above be the solution? Of course
release path won't cause an issue if they don't run, but we still need to
free the thing, right?

Thanks.

--
tejun