Re: [RFC][PATCH] x86/mm: Sync all vmalloc mappings before text_poke()

From: Steven Rostedt
Date: Thu Apr 30 2020 - 12:11:42 EST


On Thu, 30 Apr 2020 16:11:21 +0200
Joerg Roedel <jroedel@xxxxxxx> wrote:

> Hi,
>
> On Wed, Apr 29, 2020 at 10:07:31AM -0400, Steven Rostedt wrote:
> > Talking with Mathieu about this on IRC, he pointed out that my code does
> > have a vzalloc() that is called:
> >
> > in trace_pid_write()
> >
> > pid_list->pids = vzalloc((pid_list->pid_max + 7) >> 3);
> >
> > This is done when -P1,2 is on the trace-cmd command line.
>
> Okay, tracked it down, some instrumentation in the page-fault and
> double-fault handler gave me the stack-traces. Here is what happens:
>
> As already pointed out, it all happens because of page-faults on the
> vzalloc'ed pid bitmap. It starts with this stack-trace:
>
> RIP: 0010:trace_event_ignore_this_pid+0x23/0x30

Interesting. Because that function is this:

bool trace_event_ignore_this_pid(struct trace_event_file *trace_file)
{
struct trace_array *tr = trace_file->tr;
struct trace_array_cpu *data;
struct trace_pid_list *no_pid_list;
struct trace_pid_list *pid_list;

pid_list = rcu_dereference_raw(tr->filtered_pids);
no_pid_list = rcu_dereference_raw(tr->filtered_no_pids);

if (!pid_list && !no_pid_list)
return false;

data = this_cpu_ptr(tr->array_buffer.data);

return data->ignore_pid;
}

Where it only sees if the pid masks exist. That is, it looks to see if
there's pointers to them, it doesn't actually touch the vmalloc'd area.
This check is to handle a race between allocating and deallocating the
buffers and setting the ignore_pid bit. The reading of these arrays is done
at sched_switch time, which sets or clears the ignore_pid field.

That said, since this only happens on buffer instances (it does not trigger
on the top level instance, which uses the same code for the pid masks)

Could this possibly be for the tr->array_buffer.data, which is allocated
with:

allocate_trace_buffer() {
[..]
buf->data = alloc_percpu(struct trace_array_cpu);

That is, the bug isn't the vmalloc being a problem, but perhaps the per_cpu
allocation. This would explain why this crashes with the buffer instance
and not with the top level instance. If it was related to the pid masks,
then it would trigger for either (because they act the same in allocating
at time of use). But when an instance is made, the tr->array_buffer.data is
created. Which for the top level happens at boot up and the pages would
have been synced long ago. But for a newly created instance, this happens
just before its used. This could possibly explain why it's not a problem
when doing it manually by hand, because the time between creating the
instance, and the time to start and stop the tracing, is long enough for
something to sync them page tables.

tl;dr; It's not an issue with the vmalloc, it's an issue with per_cpu
allocations!


-- Steve