[PATCHv2] perf: Prevent concurent ring buffer access

From: Jiri Olsa
Date: Sun Sep 23 2018 - 12:23:11 EST


On Thu, Sep 13, 2018 at 11:37:54AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 13, 2018 at 09:46:07AM +0200, Jiri Olsa wrote:
> > On Thu, Sep 13, 2018 at 09:07:40AM +0200, Peter Zijlstra wrote:
> > > On Wed, Sep 12, 2018 at 09:33:17PM +0200, Jiri Olsa wrote:
> > > > Some of the scheduling tracepoints allow the perf_tp_event
> > > > code to write to ring buffer under different cpu than the
> > > > code is running on.
> > >
> > > ARGH.. that is indeed borken.
>
> > I was first thinking to just leave it on the current cpu,
> > but not sure current users would be ok with that ;-)
>
> > ---
> > diff --git a/kernel/events/core.c b/kernel/events/core.c
> > index abaed4f8bb7f..9b534a2ecf17 100644
> > --- a/kernel/events/core.c
> > +++ b/kernel/events/core.c
> > @@ -8308,6 +8308,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
> > continue;
> > if (event->attr.config != entry->type)
> > continue;
> > + if (event->cpu != smp_processor_id())
> > + continue;
> > if (perf_tp_event_match(event, &data, regs))
> > perf_swevent_event(event, count, &data, regs);
> > }
>
> That might indeed be the best we can do.
>
> So the whole TP muck would be responsible for placing only matching
> events on the hlist, which is where our normal CPU filter is I think.
>
> The above then does the same for @task. Which without this would also be
> getting nr_cpus copies of the event I think.
>
> It does mean not getting any events if the @task only has a per-task
> buffer, but there's nothing to be done about that. And I'm not even sure
> we can create a useful warning for that :/

ok, sending full patch (v2) with above change

cc-ing Andrew Vagin who added this feature,
because this patch change the way it works

thanks,
jirka


---
Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.

This results in corrupted ring buffer data demonstrated in
following perf commands:

# perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

Total time: 0.383 [sec]
[ perf record: Woken up 8 times to write data ]
0x42b890 [0]: failed to process type: -1765585640
[ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

# perf report --stdio
0x42b890 [0]: failed to process type: -1765585640

The reason for the corruptions are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:

sched_waking
sched_wakeup
sched_wakeup_new
sched_stat_wait
sched_stat_sleep
sched_stat_iowait
sched_stat_blocked

The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:

hlist_for_each_entry_rcu(event, head, hlist_entry)
perf_swevent_event(event, count, &data, regs);

And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:

ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->attr.type != PERF_TYPE_TRACEPOINT)
continue;
if (event->attr.config != entry->type)
continue;

perf_swevent_event(event, count, &data, regs);
}

Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is not handled at the moment.

This patch prevents the race, by allowing only events
with the same current cpu to receive the event.

Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")
Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
---
kernel/events/core.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/kernel/events/core.c b/kernel/events/core.c
index c80549bf82c6..f269f666510c 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -8308,6 +8308,8 @@ void perf_tp_event(u16 event_type, u64 count, void *record, int entry_size,
goto unlock;

list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
+ if (event->cpu != smp_processor_id())
+ continue;
if (event->attr.type != PERF_TYPE_TRACEPOINT)
continue;
if (event->attr.config != entry->type)
--
2.17.1