Re: [Phishing Risk] [External] Re: [RFC PATCH] perf/core: fix cpuctx cgrp warning

From: Chengming Zhou
Date: Thu Mar 10 2022 - 07:02:04 EST


Hello,

On 2022/3/10 5:25 下午, Namhyung Kim wrote:
> Hello,
>
> On Tue, Mar 8, 2022 at 6:00 AM Chengming Zhou
> <zhouchengming@xxxxxxxxxxxxx> wrote:
>>
>> There is a race problem that can trigger WARN_ON_ONCE(cpuctx->cgrp)
>> in perf_cgroup_switch().
>>
>> CPU1 CPU2
>> (in context_switch) (attach running task)
>> perf_cgroup_sched_out(task, next)
>> if (cgrp1 != cgrp2) True
>> task->cgroups = xxx
>> perf_cgroup_attach()
>> perf_cgroup_sched_in(prev, task)
>> if (cgrp1 != cgrp2) False
>
> But perf_cgroup_switch will be synchronized as the context switch
> disables the interrupt. And right, it still can see the task->cgroups
> is changing in the middle.
>
>>
>> The commit a8d757ef076f ("perf events: Fix slow and broken cgroup
>> context switch code") would save cpuctx switch in/out when the
>> perf_cgroup of "prev" and "next" are the same.
>>
>> But perf_cgroup of task can change in concurrent with context_switch.
>> If cgrp1 == cgrp2 in sched_out(), cpuctx won't do switch out, then
>> task perf_cgroup changed cause cgrp1 != cgrp2 in sched_in(), cpuctx
>> will do switch in, and trigger WARN_ON_ONCE(cpuctx->cgrp).
>>
>> The perf_cgroup of "prev" and "next" can be changed at any time, so we
>> first have to combine perf_cgroup_sched_in() into perf_cgroup_sched_out(),
>> so we can get a consistent value of condition (cgrp1 == cgrp2).
>>
>> And we introduce a percpu "cpu_perf_cgroups" to track the current used
>> perf_cgroup, instead of using the unstable perf_cgroup of "prev", which
>> maybe not the cpuctx->cgrp we used to schedule cgroup events on cpu.
>
> Is this really needed? I think the warning comes because the two
> cgroups were the same when in sched-out, but they became
> different when in sched-in. So just combining sched-in/out should
> be ok, isn't it?

If we get perf_cgroup from prev->cgroups that can be changed in the
context_switch(), make the condition (cgrp1 == cgrp2) is true, then
we won't do sched_out/in. So the events of prev's previous cgrp will
still be on the CPU.

Even that CPU would receive IPI from perf_cgroup_attach() after
context_switch(), remote_function() will do nothing because prev task
is not current running anymore.

>
>>
>> Fixes: a8d757ef076f ("perf events: Fix slow and broken cgroup context
>> switch code")
>> Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>
>> ---
>> kernel/events/core.c | 95 +++++++++++---------------------------------
>> 1 file changed, 23 insertions(+), 72 deletions(-)
>>
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 6859229497b1..f3bc2841141f 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -826,6 +826,7 @@ perf_cgroup_set_timestamp(struct task_struct *task,
>> }
>> }
>>
>> +static DEFINE_PER_CPU(struct perf_cgroup *, cpu_perf_cgroups);
>> static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
>>
>> #define PERF_CGROUP_SWOUT 0x1 /* cgroup switch out every event */
>> @@ -837,8 +838,9 @@ static DEFINE_PER_CPU(struct list_head, cgrp_cpuctx_list);
>> * mode SWOUT : schedule out everything
>> * mode SWIN : schedule in based on cgroup for next
>
> You can remove this comment now.

Ok, will do.

>
>> */
>> -static void perf_cgroup_switch(struct task_struct *task, int mode)
>> +static void perf_cgroup_switch(struct task_struct *task)
>> {
>> + struct perf_cgroup *cgrp;
>> struct perf_cpu_context *cpuctx, *tmp;
>> struct list_head *list;
>> unsigned long flags;
>> @@ -849,6 +851,9 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>> */
>> local_irq_save(flags);
>>
>> + cgrp = perf_cgroup_from_task(task, NULL);
>> + __this_cpu_write(cpu_perf_cgroups, cgrp);
>> +
>> list = this_cpu_ptr(&cgrp_cpuctx_list);
>> list_for_each_entry_safe(cpuctx, tmp, list, cgrp_cpuctx_entry) {
>> WARN_ON_ONCE(cpuctx->ctx.nr_cgroups == 0);
>> @@ -856,28 +861,15 @@ static void perf_cgroup_switch(struct task_struct *task, int mode)
>> perf_ctx_lock(cpuctx, cpuctx->task_ctx);
>> perf_pmu_disable(cpuctx->ctx.pmu);
>>
>> - if (mode & PERF_CGROUP_SWOUT) {
>> - cpu_ctx_sched_out(cpuctx, EVENT_ALL);
>> - /*
>> - * must not be done before ctxswout due
>> - * to event_filter_match() in event_sched_out()
>
> Unrelated, but I don't see the event_filter_match() in
> event_sched_out() anymore. Does it sched-out all
> non-cgroup cpu events here?

Yes, I review the code and don't find event_filter_match(),
so cpu_ctx_sched_out() will sched-out all cpu events.

And I find event_filter_match() won't work here too,
because perf_cgroup_match() return matched for any
non-cgroup event. Maybe we can add another function
like perf_cgroup_match_sched_out() to use when sched-out.

>
>> - */
>> - cpuctx->cgrp = NULL;
>> - }
>> + cpu_ctx_sched_out(cpuctx, EVENT_ALL);
>> + /*
>> + * must not be done before ctxswout due
>> + * to event_filter_match() in event_sched_out()
>> + */
>> + cpuctx->cgrp = cgrp;
>
> Maybe we can check cpuctx->cgrp is the same as task's
> cgroup before accessing the pmu. As in the commit message
> it can call perf_cgroup_switch() after the context switch so
> the cgroup events might be scheduled already.

Good point, will do.

Thanks.

>
> Thanks,
> Namhyung
>
>
>> +
>> + cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
>>
>> - if (mode & PERF_CGROUP_SWIN) {
>> - WARN_ON_ONCE(cpuctx->cgrp);
>> - /*
>> - * set cgrp before ctxsw in to allow
>> - * event_filter_match() to not have to pass
>> - * task around
>> - * we pass the cpuctx->ctx to perf_cgroup_from_task()
>> - * because cgorup events are only per-cpu
>> - */
>> - cpuctx->cgrp = perf_cgroup_from_task(task,
>> - &cpuctx->ctx);
>> - cpu_ctx_sched_in(cpuctx, EVENT_ALL, task);
>> - }
>> perf_pmu_enable(cpuctx->ctx.pmu);
>> perf_ctx_unlock(cpuctx, cpuctx->task_ctx);
>> }