Re: [RFC][PATCH 0/3] ftrace: Add dynamically allocated trampolines

From: Steven Rostedt
Date: Wed Jul 23 2014 - 11:48:40 EST


On Wed, 23 Jul 2014 14:08:05 +0200
Oleg Nesterov <oleg@xxxxxxxxxx> wrote:

> On 07/22, Steven Rostedt wrote:
> >
> > On Tue, 22 Jul 2014 18:47:07 +0200
> > Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> >
> > > On 07/03, Steven Rostedt wrote:
> > >
> > > > The way the function callback mechanism works in ftrace is that if there's
> > > > only one function callback registered, it will set the mcount/fentry
> > > > trampoline to call that function directly. But as soon as you register
> > > > another callback, the mcount trampoline calls a loop function that iterates
> > > > over all the registered callbacks (ftrace_ops) checking their hash tables
> > > > to see if the called function matches the ops before calling its callback.
> > > > This happens even if the two registered functions are not even tracing
> > > > the same function!
> > > >
> > > > This really sucks if you are tracing all functions, and then add a kprobe
> > > > or perf event that traces a single function. That will cause all the
> > > > other functions being traced to perform the loop test.
> > >
> > > But this is even worse or I missed something? I mean, currently even
> > > if you trace nothing and then add a KPROBE_FLAG_FTRACE kprobe, then
> > > kprobe_ftrace_handler() is called by ftrace_ops_list_func() ?
> >
> > It shouldn't be. It should get called directly from the trampoline. The
> > allocated trampoline should never call the list op. Well, it might
> > during the conversion for safety, but after that, trampolines should
> > only call the registered ftrace_ops->func directly.
>
> I meant the current code (I am reading 3.16-rc2). Even if we have a single
> KPROBE_FLAG_FTRACE kprobe, kprobe_ftrace_handler() won't be called directly.
>

Oh, I thought we were still talking about the trampolines, that's not
in 3.16-rc2.

> Or I misunderstood your reply? Just in case, let me check...
>
> With this stupid patch
>
> --- a/kernel/trace/ftrace.c
> +++ b/kernel/trace/ftrace.c
> @@ -4464,6 +4464,7 @@ __ftrace_ops_list_func(unsigned long ip, unsigned long parent_ip,
> printk("op=%p %pS\n", op, op);
> goto out;
> }
> + pr_crit("LIST_FUNC -> %pf()\n", op->func);
> op->func(ip, parent_ip, op, regs);
> }
> } while_for_each_ftrace_op(op);
>
> I do
> # cd /sys/kernel/debug/tracing/
> # echo "p:xx SyS_prctl+0x1c" >| kprobe_events
> # cat ../kprobes/list
> ffffffff81056c4c k SyS_prctl+0x1c [DISABLED][FTRACE]
> # echo 1 >| events/kprobes/xx/enable
> #
> # perl -e 'syscall 157,-1'
> # dmesg
> LIST_FUNC -> kprobe_ftrace_handler()
>
> so it is really called by the loop test code.
>
> And I guess that after your patches kprobe_ftrace_handler() should be called
> from the trampoline in this case.

No it wont be. Not until we have Paul McKenney's task_rcu code that
will return after all tasks have either gone through userspace or a
schedule. Hmm, maybe on a !CONFIG_PREEMPT it will be OK. Oh, I can have
that be OK now on !CONFIG_PREEMPT. Maybe I'll do that too.

kprobe ftrace_ops are allocated which sets the FTRACE_OPS_FL_DYNAMIC
flag. You'll see that flag checked in update_ftrace_function(), and if
it is set, it forces the ftrace_ops_list_func() to be used.

Why?

Because we can not know if a task has just jumped to the function with
the ops that has been freed. Here's what I mean:


foo()
[mcount called --> ftrace_caller trampoline]
ftrace_caller
load ftrace_ops into parameter
<interrupt>
preempt_schedule()
[new task]
kfree(kprobe ftrace_ops);
[schedule back]
call kprobes ftrace ops func
uses ops, but it was freed!
[BOOM!]


A trampoline can not access anything that can be freed. In this case,
the kprobes ftrace_ops can be, and that forces the list function to be
used. Why is the list function ok? Because it does:

preempt_disable();
loop all ops
ops->func()
preempt_enable();

And after we disconnect an ops from the list, we call
schedule_on_each_cpu() before freeing it.

Why not use synchronize_sched()?

Because ftrace can be called outside of areas that rcu is aware of,
including the rcu infrastructure itself. That means there's a extremely
rare case that synchronize_sched() can return before that loop of ops
is finished.

Although it's extremely rare, and even possibly impossible to hit, the
fact that theoretically it can be, we must do the even slower but
guaranteed safe schedule_on_each_cpu() to make sure that nothing was in
that preempt disable location before we free it.

>
> > > ftrace_save_ops_tramp_hash():
> > >
> > > do_for_each_ftrace_rec(pg, rec) {
> > > if (ftrace_rec_count(rec) == 1 &&
> > > ftrace_ops_test(ops, rec->ip, rec)) {
> > >
> > > /* This record had better have a trampoline */
> > > if (FTRACE_WARN_ON(!(rec->flags & FTRACE_FL_TRAMP_EN)))
> > > return -1;
> > >
> > > Yes, but I can't understand how this can work.
> >
> > I wanted the back to 1 case to happen after we get the up to one case
> > working. That is, I don't want to worry about it now ;-) As you can
> > see, this code has enough things to try to keep straight without adding
> > more complexity to the mix.
>
> Yes, I see... but note that this WARN_ON() looks wrong in any case. At
> least currently.

Yeah, in my linux-next repo it quite possibly can be. I'm looking into
fixing that now.

Thanks!

-- Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/