Re: Possible race between CPU hotplug and perf_pmu_migrate_context

From: Linus Torvalds
Date: Fri Sep 05 2014 - 13:31:55 EST


On Fri, Sep 5, 2014 at 9:59 AM, Mark Rutland <mark.rutland@xxxxxxx> wrote:
>
> As you point out below, the race on event->ctx is the fundamental issue. That
> is what results in decrementing the refcount twice (once on a stale event->ctx
> pointer).

So quite frankly, the whole perf_pmu_migrate_context() thing looks
completely and fundamentally broken.

Your patch doesn't really solve anything in general, and would need to
be extended to do that get_online_cpus() around every single case
where we read it. Which isn't really acceptable.

perf_pmu_migrate_context() is just f*cked up. It needs to be reverted.
It seems to think that "the event->ctx" pointer is a RCU-protected
pointer, but it really isn't. Its lifetime is protected by
refcounting, and the pointer access itself isn't protected by
anything, because "event->ctx" isn't supposed to change over the
lifetime of "event". Nobody does all the necessary rcu_read_[un]lock()
around it, nor access it with rcu_dereference(), because all users
know that it's just fixed.

Quite frankly, I think commit 0cda4c023132 ("perf: Introduce
perf_pmu_migrate_context()") is unfixably broken. The whole thing
needs to be re-thought. It violates everything that everybody else
does with the context. No amount of (sane) locking can fix the
breakage, I suspect.

A possible (but unfortunate) real fix is to try to keep the broken
code around, and make its broken assumptions be actually true. Make
"event->ctx" truly be an RCU pointer. Add the RCU annotations, add the
required rcu read-locking, and make everything really do what that
currently completely broken perf_pmu_migrate_context() thinks it does.

That implies, for example, that any time you use the thing in a
sleeping context, you'd need to do something like this (possibly with
a helper function to do that: "get_event_ctx()"

rcu_read_lock();
ctx = rcu_dereference(event->rcu);
get_ctx(ctx);
rcu_read_unlock();

.. you can now sleep here and use 'ctx' ...

put_ctx(ctx);

and if you don't need to sleep, you can just do

rcu_read_lock();
ctx = rcu_dereference(event->ctx);
.. use ctx in an atomic context ...
rcu_read_unlock();

but that is a much bigger patch than either of the "introduce a
completely broken ad-hoc lock around just one random use case"
patches.

Alternatively (and this sounds like the better fix), use some way to
avoid that broken perf_pmu_migrate_context() model entirely. Never
"migrate" events. Create completely new ones on the new CPU, and leave
the old stale ones alone even on dead CPU's (they will presumably not
actually be activated, and who the hell cares?).

Or even just say: "if somebody takes down a CPU with existing uncore
events, those events are now effectively dead". Don't try to migrate
them, don't try to create new events on another CPU, just let it go.
The CPU is down, the events are inactive.

Keep it simple. Not the current completely broken thing that clearly
doesn't honor the actual real rules for "event->ctx". Because the
current rules are effectively that event->ctx is a constant over the
whole lifetime of the event. Agreed?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/