Re: [tip:perf/core] perf: Add cgroup support

From: Stephane Eranian
Date: Thu Feb 17 2011 - 09:45:15 EST


On Thu, Feb 17, 2011 at 12:36 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
> On Thu, 2011-02-17 at 12:16 +0100, Stephane Eranian wrote:
>> Peter,
>>
>> On Wed, Feb 16, 2011 at 5:57 PM, Peter Zijlstra <a.p.zijlstra@xxxxxxxxx> wrote:
>> > On Wed, 2011-02-16 at 13:46 +0000, tip-bot for Stephane Eranian wrote:
>> >> +static inline struct perf_cgroup *
>> >> +perf_cgroup_from_task(struct task_struct *task)
>> >> +{
>> >> + Â Â Â return container_of(task_subsys_state(task, perf_subsys_id),
>> >> + Â Â Â Â Â Â Â Â Â Â Â struct perf_cgroup, css);
>> >> +}
>> >
>> > ===================================================
>> > [ INFO: suspicious rcu_dereference_check() usage. ]
>> > ---------------------------------------------------
>> > include/linux/cgroup.h:547 invoked rcu_dereference_check() without protection!
>> > other info that might help us debug this:
>> > rcu_scheduler_active = 1, debug_locks = 1
>> > 1 lock held by perf/1774:
>> > Â#0: Â(&ctx->lock){......}, at: [<ffffffff810afb91>] ctx_sched_in+0x2a/0x37b
>> > stack backtrace:
>> > Pid: 1774, comm: perf Not tainted 2.6.38-rc5-tip+ #94017
>> > Call Trace:
>> > Â[<ffffffff81070932>] ? lockdep_rcu_dereference+0x9d/0xa5
>> > Â[<ffffffff810afc4e>] ? ctx_sched_in+0xe7/0x37b
>> > Â[<ffffffff810aff37>] ? perf_event_context_sched_in+0x55/0xa3
>> > Â[<ffffffff810b0203>] ? __perf_event_task_sched_in+0x20/0x5b
>> > Â[<ffffffff81035714>] ? finish_task_switch+0x49/0xf4
>> > Â[<ffffffff81340d60>] ? schedule+0x9cc/0xa85
>> > Â[<ffffffff8110a84c>] ? vfsmount_lock_global_unlock_online+0x9e/0xb0
>> > Â[<ffffffff8110b556>] ? mntput_no_expire+0x4e/0xc1
>> > Â[<ffffffff8110b5ef>] ? mntput+0x26/0x28
>> > Â[<ffffffff810f2add>] ? fput+0x1a0/0x1af
>> > Â[<ffffffff81002eb9>] ? int_careful+0xb/0x2c
>> > Â[<ffffffff813432bf>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>> > Â[<ffffffff81002ec7>] ? int_careful+0x19/0x2c
>> >
>> >
>> I have lockedp enabled in my kernel and during all my tests
>> I never saw this warning. How did you trigger this?
>
> CONFIG_PROVE_RCU=y, its a bit of a shiny feature but most of the false
> positives are gone these days I think.
>
I have this one enabled, yet no message.

>> > The simple fix seemed to be to add:
>> >
>> > diff --git a/kernel/perf_event.c b/kernel/perf_event.c
>> > index a0a6987..e739e6f 100644
>> > --- a/kernel/perf_event.c
>> > +++ b/kernel/perf_event.c
>> > @@ -204,7 +204,8 @@ __get_cpu_context(struct perf_event_context *ctx)
>> > Âstatic inline struct perf_cgroup *
>> > Âperf_cgroup_from_task(struct task_struct *task)
>> > Â{
>> > - Â Â Â return container_of(task_subsys_state(task, perf_subsys_id),
>> > + Â Â Â return container_of(task_subsys_state_check(task, perf_subsys_id,
>> > + Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â lockdep_is_held(&ctx->lock)),
>> > Â Â Â Â Â Â Â Â Â Â Â Âstruct perf_cgroup, css);
>> > Â}
>> >
>> > For all callers _should_ hold ctx->lock and ctx->lock is acquired during
>> > ->attach/->exit so holding that lock will pin the cgroup.
>> >
>> I am not sure I follow you here. Are you talking about cgroup_attach()
>> and cgroup_exit()? perf_cgroup_switch() does eventually grab ctx->lock
>> when it gets to the actual save and restore functions. But
>> perf_cgroup_from_task()
>> is called outside of those sections in perf_cgroup_switch().
>
> Right, but there we hold rcu_read_lock().
>
> So what we're saying here is that its ok to dereference the variable
> provided we hold either:
> Â- rcu_read_lock
> Â- task->alloc_lock
> Â- cgroup_lock
>
> or
>
> Â- ctx->lock
>
> task->alloc_lock and cgroup_lock both avoid any changes to the current
> task's cgroup due to kernel/cgroup.c locking. ctx->lock avoids this due
> to us taking that lock in perf_cgroup_attach() and perf_cgroup_exit()
> when this task is active.
>
We do not take ctx->lock in those functions (at least not directly).
Both functions end up in perf_cgroup_switch() which does rcu_read_lock()
for all its operations. ctx->lock becomes held once you get into ctx_sched_out()
or ctx_sched_in(). But according to what you're saying above, that should
cover it.

>> > However, not all update_context_time()/update_cgrp_time_from_event()
>> > callers actually hold ctx->lock, which is a bug because that lock also
>> > serializes the timestamps.
>> >
>> > Most notably, task_clock_event_read(), which leads us to:
>> >
>>
>> If the warning comes from invoking perf_cgroup_from_task(), then there is also
>> perf_cgroup_switch(). that one is not grabbing any ctx->lock either, but maybe
>> not on all paths.
>>
>> > @@ -5794,9 +5795,14 @@ static void task_clock_event_read(struct perf_event *event)
>> > Â Â Â Âu64 time;
>> >
>> > Â Â Â Âif (!in_nmi()) {
>> > - Â Â Â Â Â Â Â update_context_time(event->ctx);
>> > + Â Â Â Â Â Â Â struct perf_event_context *ctx = event->ctx;
>> > + Â Â Â Â Â Â Â unsigned long flags;
>> > +
>> > + Â Â Â Â Â Â Â spin_lock_irqsave(&ctx->lock, flags);
>> > + Â Â Â Â Â Â Â update_context_time(ctx);
>> > Â Â Â Â Â Â Â Âupdate_cgrp_time_from_event(event);
>> > - Â Â Â Â Â Â Â time = event->ctx->time;
>> > + Â Â Â Â Â Â Â time = ctx->time;
>> > + Â Â Â Â Â Â Â spin_unlock_irqrestore(&ctx->lock, flags);
>> > Â Â Â Â} else {
>> > Â Â Â Â Â Â Â Âu64 now = perf_clock();
>> > Â Â Â Â Â Â Â Âu64 delta = now - event->ctx->timestamp;
>
> I just thought we should probably kill the !in_nmi branch, I'm not quite
> sure why that exists..

I don't quite understand what this event is supposed to count in system-wide
mode. This function adds a time delta. It may be using the wrong time source
in cgroup mode.

Having said that, it seems to me like we may not even need the call to
update_cgrp_time_from_event() there. It is not even used to compute
the time delta in that function. Yet, we do get correct timings in cgroup
mode. Thus, I suspect the timing is taken care by callers already whenever
needed. I looked at the pmu->read() callers, and it seems they do exactly
that. In summary, I believe we may be able to drop this call.

>
>> > I then realized that the events themselves pin the cgroup, so its all
>> > cosmetic at best, but then I already had the below patch...
>> >
>> I assume by 'pin the group' you mean the cgroup cannot disappear
>> while there is at least one event pointing to it. That's is indeed true
>> thanks to refcounting (css_get()).
>
> Right, that's what I was thinking, but now I think that's not
> sufficient, we can have cgroups without events but with tasks in for
> which the races are still valid.
>
But in that case, no perf_event code should be fiddling with cgroups.
I think there are guards for that, either is_cgroup_event() or ctx->nr_cgroups.

But it seems perf_cgroup_from_event() is the one exception. So maybe
we could rewrite it:

static inline void update_cgrp_time_from_event(struct perf_event *event)
{
struct perf_cgroup *cgrp;

if (!is_cgroup_event(event))
return;

cgrp = perf_cgroup_from_task(current);
/*
* do not update time when cgroup is not active
*/
if (cgrp != event->cgrp)
return;

__update_cgrp_time(event->cgrp);
}


> Also:
>
> ---
> diff --git a/kernel/perf_event.c b/kernel/perf_event.c
> index a0a6987..ab28e56 100644
> --- a/kernel/perf_event.c
> +++ b/kernel/perf_event.c
> @@ -7330,12 +7330,10 @@ static struct cgroup_subsys_state *perf_cgroup_create(
> Â Â Â Âstruct perf_cgroup_info *t;
> Â Â Â Âint c;
>
> - Â Â Â jc = kmalloc(sizeof(*jc), GFP_KERNEL);
> + Â Â Â jc = kzalloc(sizeof(*jc), GFP_KERNEL);
> Â Â Â Âif (!jc)
> Â Â Â Â Â Â Â Âreturn ERR_PTR(-ENOMEM);
>
> - Â Â Â memset(jc, 0, sizeof(*jc));
> -
> Â Â Â Âjc->info = alloc_percpu(struct perf_cgroup_info);
> Â Â Â Âif (!jc->info) {
> Â Â Â Â Â Â Â Âkfree(jc);
>
Yep.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/