Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

From: Tadeusz Struk
Date: Thu Jun 02 2022 - 10:28:34 EST

Next message: Santosh Shukla: "[PATCH 6/7] KVM: nSVM: implement nested VNMI"
Previous message: Santosh Shukla: "[PATCH 5/7] KVM: SVM: Add VNMI support in inject_nmi"
In reply to: Michal Koutný: "Re: [PATCH 2/2] cgroup: Use separate work structs on css release path"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 6/2/22 04:47, Michal Koutný wrote:

On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk<tadeusz.struk@xxxxxxxxxx> wrote:

css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
and it will only enqueue the css_killed_work_fn() to be called later.
Then css_put()->css_release() will be called before the css_killed_work_fn() will even
get a chance to run, and it will also*only* enqueue css_release_work_fn() to be called later.
The problem happens on the second enqueue. So there need to be something in place that
will make sure that css_killed_work_fn() is done before css_release() can enqueue
the second job.

IIUC, here you describe the same scenario I broke down at [1].

Right, except the last css_put(), which I think is called from cgroup_kn_unlock()
See below.

Does it sound right?

I added a parameter A there (that is sum of base and percpu references
before kill_css()).
I thought it fails because A == 1 (i.e. killing the base reference),
however, that seems an unlikely situation (because cgroup code uses a
"fuse" reference to pin css for offline_css()).

So the remaining option (at least I find it more likely now) is that
A == 0 (A < 0 would trigger the warning in
percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can
get to the bottom of this with detailed enough tracing of gets/puts.

Splitting the work struct is condradictive to the existing approach with
the "fuse" reference.

(BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk<tadeusz.struk@xxxxxxxxxx> wrote:

The fact the css_release() is called (via cgroup_kn_unlock()) just after
kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
(cgroup_destroy_wq), just with different function. This results in the
BUG: corrupted list in insert_work issue.

Where do you see a critical css_release called from cgroup_kn_unlock()?
I always observed the css_release() being called via
percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot

it goes like this:
cgroup_kn_unlock(kn)->cgroup_put(cgrp)->css_put(&cgrp->self), which
brings the refcnt to zero and triggers css_release().
I think what's missing is something that will serialize the kill
and release paths. I will try to put something together today.

--
Thanks,
Tadeusz

Next message: Santosh Shukla: "[PATCH 6/7] KVM: nSVM: implement nested VNMI"
Previous message: Santosh Shukla: "[PATCH 5/7] KVM: SVM: Add VNMI support in inject_nmi"
In reply to: Michal Koutný: "Re: [PATCH 2/2] cgroup: Use separate work structs on css release path"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]