Re: [perf] more perf_fuzzer memory corruption

From: Peter Zijlstra
Date: Tue Apr 29 2014 - 05:46:41 EST


On Mon, Apr 28, 2014 at 03:38:38PM -0400, Vince Weaver wrote:
>
> OK, this is my current theory as to what's going on. I'd appreciate any
> comments.
>
>
> We have an event, let's call it #16.
>
> Event #16 is a SW event created and running in the parent on CPU0.

A regular software one, right? Not a timer one.

> CPU0 (parent): calls fork()
>
> CPU6 (child): SW Event #16 is still running on CPU0 but is visible
> on CPU6 because the fd passed through with fork
>
> CPU0 (parent) close #16. Event not deallocated because
> still visible in child
>
> CPU0 (parent) kill child

OK so far..

> CPU6 (child) shutting down.
> last user of event #16
> perf_release() called on event
> which eventually calls event_sched_out()
> which calls pmu->del which removes event from swevent_htable
> *but only on CPU6*

So on fork() we'll also clone the counter; after which there's two. One
will run on each task.

Because of a context switch optimization they can actually flip around
(the below patch disables that).

So there's (possibly) two events being killed here:

1) the event that is attached to the child process, it will be detached
and freed. do_exit() -> perf_event_exit_task() ->
perf_event_exit_task_context() -> __perf_event_exit_task()

If this is the cloned event, it will put the original event and be
freed here.

If the child ran the cloned event; then:
2) the closing of the fd, coupled with the put of 1) will drop the
refcount of the original event to 0 and it too will be removed and
freed.

IF however the original and cloned events were flipped at the time; the
child exit will detach the original event, but since the parent will
still have a cloned event attached, the clone will keep the event alive.

In this case no events will be freed until the parent too exits; at
which point the cloned event will get detached and freed. That will put
the last reference on the actual event, and that too will go.


Now, seeing that you actually see an event getting freed, we'll have to
assume the former situation, where the original event is on the parent
process and the cloned event is on the child process.

> **** some sort of race happens with CPU0 (possibly with
> event_sched_in() and event->state==PERF_EVENT_STATE_INACTIVE)
> That has event #16 in the cpu0 swevent_htable but not
> freed the next time ctx_sched_out() happens ****

So on do_exit(), exit_files() happens first, it will drop the refcount
of the original event to 1.

After that, perf_event_exit_task() happens, it will (as per the
callchain above) end up in __perf_event_exit_task().

__perf_event_exit_task() will call perf_group_detach, however no groups
involved here afaik, so that's quickly done.

It will then call perf_remove_from_context() which will try to
deschedule (which is likely already done by
perf_event_exit_task_context() which de-schedules the entire context in
one go), and then remove the event from the context.

Since it is the cloned event; it will then call sync_child_event(),
whicih will push whatever counts it has gathered into the original
(parent) event, and detach itself from the parent.

This will have done put_event(parent_event), which will drop the
refcount of the original event to 0. put_event() will in turn call the
same things: perf_group_detach() -- no groups, done quickly.
perf_remove_from_context(), this will IPI from CPU6 to CPU0, and
deschedule the original event, calling ->del() on CPU0, and as per the
above continue doing list_del_event() detaching itself from the context.

After the IPI put_event() will continue with _free_event() and we'll
call ->destroy() and call_rcu and the event will be no more.

After all that, the child continues calling free_event() which will also
call ->desotry() (but on the child event) and do the same call_rcu()
also freeing this event.

Nothing left.

> CPU6 (idle) grace period expires, kfree happens
>
> the CPU0 hlist still has in the list the now freed (and poisoned)
> event which causes problems, especially as new events added to
> the list over-write bytes starting at 0x48 with pprev values.

Right, so clearly something's gone funny.. above you speculate on a race
against event_sched_in(), but both paths serialize on event->ctx->lock.

__perf_remove_from_context() takes ctx->lock, as do the sched_in/out
paths.

quite the puzzle this one
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/