Re: [RFC PATCH -tip 0/5] kprobes: Abolish jprobe APIs

From: Steven Rostedt
Date: Mon Oct 09 2017 - 12:20:45 EST


On Mon, 9 Oct 2017 09:33:50 -0600
Jonathan Corbet <corbet@xxxxxxx> wrote:

> On Fri, 6 Oct 2017 11:34:30 -0400
> Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> > On Fri, 6 Oct 2017 13:49:59 +0900
> > Masami Hiramatsu <mhiramat@xxxxxxxxxx> wrote:
> >
> > > Steve, could you write a documentation how to use ftrace callback?
> > > I think I should update the Documentation/kprobes.txt so that jprobe
> > > user can easily migrate on that.
> >
> > I decided to do this now. Here's a first draft. What do you think?
>
> An overall request: this document is already 99% in the RST format. It
> would be nice to just do it in RST to begin with and save somebody the
> effort of finishing the job later.

I guess that means I need to learn RST formatting ;-)

>
> > -- Steve
> >
> > Using ftrace to hook to functions
> > =================================
> >
> > Copyright 2017 VMware Inc.
> > Author: Steven Rostedt <srostedt@xxxxxxxxxxx>
> > License: The GNU Free Documentation License, Version 1.2
> > (dual licensed under the GPL v2)
> >
> > Written for: 4.14
>
> Experience says that strings like this go obsolete in a hurry, so they
> don't really reflect the state of the document. The git history tends to
> be rather more reliable.

I actually use that for my own reference. I look at that number and it
tells me (oh crap, I need to update this).

Note, when I do an update, I will add:

Updated for: v5.0

Again, it's more for me, as I don't do a git log on it to see when I
last updated the document. It's usually me referencing a document and
noticing the number that's very old.

>
> > Introduction
> > ------------
> >
> > The ftrace infrastructure was originially created to attach hooks to the
> > beginning of functions in order to record and trace the flow of the kernel.
> > But hooks to the start of a function can have other use cases. Either
> > for live kernel patching, or for security monitoring. This document describes
> > how to use ftrace to implement your own function hooks.
> >
> >
> > The ftrace context
> > ==================
> >
> > WARNING: The ability to add a callback to almost any function within the
> > kernel comes with risks. A callback can be called from any context
> > (normal, softirq, irq, and NMI). Callbacks can also be called just before
> > going to idle, during CPU bring up and takedown, or going to user space.
> > This requires extra care to what can be done inside a callback. A callback
> > can be called outside the protective scope of RCU.
> >
> > The ftrace infrastructure has some protections agains recursions and RCU
> > but one must still be very careful how they use the callbacks.
> >
> >
> > The ftrace_ops structure
> > ========================
> >
> > To register a function callback, a ftrace_ops is required.
>
> It would be nice to say which file should be #include'd to get these
> declarations.

Sure.

/me thinks... I think that would be

#include <linux/ftrace.h>

Will add (after testing).

>
> > This structure
> > is used to tell ftrace what function should be called as the callback
> > as well as what protections the callback will perform and not require
> > ftrace to handle.
> >
> > There are only two fields that are needed to be set when registering
> > an ftrace_ops with ftrace. The rest should be NULL.
> >
> > struct ftrace_ops ops = {
> > .func = my_callback_func,
> > .flags = MY_FTRACE_FLAGS
> > .private = any_private_data_structure,
> > };
> >
> > Both .flags and .private are optional. Only .func is required.
>
> So...only two fields are needed, but three are shown, and only one is
> required. Dumb people like me are easily confused by such things...

Ah, no that was a mistake. Thanks for catching it. It should have said,
only one field is required.

>
> > To enable tracing call:
> >
> > register_ftrace_function(&ops);
> >
> > To disable tracing call:
> >
> > unregister_ftrace_function(@ops);
>
> I would presume that the callback can be called immediately, perhaps even
> before register_ftrace_function() returns, right? It never hurts to warn
> readers that they need to be ready from the outset.

Yes. I'll add that. I'll also state that it should be safe to assume
that after unregister_ftrace_function() returns, the ops.func will no
longer be called. The unregister_ftrace_function() goes to great pains
to guarantee that. But I should also say that, in order to do that,
unregister_ftrace_function() can take a while to execute (it does a lot
of synchronization to make sure all callbacks have finished).

>
> >
> > The callback function
> > =====================
> >
> > The prototype of the callback function is as follows (as of v4.14):
> >
> > void callback_func(unsigned long ip, unsigned long parent_ip,
> > struct ftrace_ops *op, struct pt_regs *regs);
> >
> > @ip - This is the instruction pointer of the function that is being traced.
> > (where the fentry or mcount is within the function)
> >
> > @parent_ip - This is the instruction pointer of the function that called the
> > the function being traced (where the call of the function occurred).
> >
> > @op - This is a pointer to ftrace_ops that was used to register the callback.
> > This can be used to pass data to the callback via the private pointer.
> >
> > @regs - If the FTRACE_OPS_FL_SAVE_REGS or FTRACE_OPS_FL_SAVE_REGS_IF_SUPPORTED
> > flags are set in the ftrace_ops structure, then this will be pointing
> > to the pt_regs structure like it would be if an breakpoint was placed
> > at the start of the function where ftrace was tracing. Otherwise it
> > either contains garbage, or NULL.
> >
> >
> > The ftrace FLAGS
> > ================
> >
> > The ftrace_ops flags are all defined and documented in include/linux/ftrace.h.
> > Some of the flags are used for internal infrastructure of ftrace, but the
> > ones that users should be aware of are the following:
> >
> > (All of these are prefixed with FTRACE_OPS_FL_)
>
> I hate to say it, but you'll reduce the cognitive load on the reader if you
> just spell the flags out.

You mean for each one? OK. Personally I prefer not seeing the duplicate
prefix, but I trust your judgment on this one.

>
> > PER_CPU - When set, the callback can be enabled or disabled per cpu with the
> > following functions:
> >
> > void ftrace_function_local_enable(struct ftrace_ops *ops);
> > void ftrace_function_local_disable(struct ftrace_ops *ops);
> >
> > These two functions must be called with preemption disabled.
>
> s/per cpu/per-CPU/
>
> More importantly, though, what does this actually mean? Presumably that
> the callback is only called if the function is hit on an enabled CPU? It
> seems you can only enable/disable the local CPU? Is the default state
> enabled or disabled? Inquiring minds want to know.

Heh, I don't know the answer to that. I'll have to go look. This was
added for perf, as perf likes to have more fine grain control.


>
> > SAVE_REGS - If the callback requires reading or modifying the pt_regs
> > passed to the callback, then it must set this flag. Registering
> > a ftrace_ops with this flag set on an architecture that does not
> > support passing of pt_regs to the callback, will fail.
>
> Drop that last comma :)

OK.

>
> > SAVE_REGS_IF_SUPPORTED - Similar to SAVE_REGS but the registering of a
> > ftrace_ops on an architecture that does not support passing of regs
> > will not fail with this flag set. But the callback must check if
> > regs is NULL or not to determine if the architecture supports it.
> >
> > RECURSION_SAFE - By default, a wrapper is added around the callback to
> > make sure that recursion of the function does not occur. That is
> > if a function within the callback itself is also traced, ftrace
>
> s/within the/called by the/

I put in "within" because it is usually a function that is nested
within a function called by the callback. This bug has come up with
"gotchas", where some function that the callback calls has a path to a
function that was unexpectedly traced.

The issue hasn't been caused by a function being traced that was
directly called by the callback. It is usually some deeper nested
function.

I don't want to limit it to just checking functions that the callback
calls. Thoughts on how to stress this?


>
> > will prevent the callback from being called again. But this wrapper
> > adds some overhead, and if the callback is safe from recursion,
> > it can set this flag to disable the ftrace protection.
>
> What happens if you mark a function safe and it recurses anyway? Say, if
> somebody else has hooked a different function unknown to the current
> caller?

It crashes the box ;-)


>
> > IPMODIFY - Requires SAVE_REGS set. If the callback is to "hijack" the
> > traced function (have another function called instead of the traced
> > function), it requires setting this flag. This is what live kernel
> > patches uses. Without this flag the pt_regs->ip can not be modified.
> > Note, only one ftrace_ops with IPMODIFY set may be registered to
> > any given function at a time.
>
> I assume this requires being able to get the regs too?

Yes, this is why I stated "Requires SAVE_REGS" which would pass the
regs to the callback. Should I rewrite that somehow?

>
> > RCU - If this is set, then the callback will only be called by functions
> > where RCU is "watching". This is required if the callback function
> > performs any rcu_read_lock() operation.
>
> What does that mean, exactly? When might RCU not be watching?

Hmm, I thought I stated this someplace else, but I'll state it here
again too.

RCU is not watching when the CPU goes idle or goes offline, and when it
is coming back on line. Also in some cases, RCU is not watching
switching from kernel to user space and back again. If a function is
called that is traced while RCU is in the process of switching between
any of these transactions, it may not be watching (no RCU protection
with any of the RCU synchronize functions).

>
>
> > Filtering what functions to trace
>
> s/what/which/

OK

>
> > =================================
> >
> > If a callback is only to be called from specific functions, a filter must be
> > set up. The filters are added by name, or ip if it is known.
> >
> > int ftrace_set_filter(struct ftrace_ops *ops, unsigned char *buf,
> > int len, int reset);
> >
> > @ops - the ops to set the filter with
> > @buf - the string that holds the function filter text.
> > @len - the length of the string.
> > @reset - non zero to reset all filters before applying this filter.
> >
> > Filters denote which functions should be enabled when tracing is enabled.
> > If @buf is NULL and reset is set, all functions will be enabled for tracing.
>
> ...maybe a pointer to documentation on the full filter-string syntax?
>
> > The @buf can also be a glob expression to enable all functions that
> > match a specific pattern.
> >
> > To just trace the schedule function:
> >
> > ret = ftrace_set_filter(&ops, "schedule", strlen("schedule"), 0);
> >
> > To add more functions, call the ftrace_set_filter() more than once with the
> > @reset parameter set to zero. To remove the current filter and replace it
> > with new functions to trace, have @reset be non zero.
>
> non-zero

OK

>
> > Sometimes more than one function has the same name. To trace just a specific
> > function in this case, ftrace_set_filter_ip() can be used.
> >
> > ret = ftrace_set_filter_ip(&ops, ip, 0, 0);
> >
> > Although the ip must be the address where the call to fentry or mcount is
> > located in the function.
>
> How do I get that if I know I want to trace foo() in bar.c?

There's no easy answer here. The current users of this is live patching
and kprobes, where the address is given directly by the subsystem using
it, and it doesn't need to look up by name.

>
> > If a glob is used to set the filter, to remove unwanted matches the
> > ftrace_set_notrace() can also be used.
> >
> > int ftrace_set_notrace(struct ftrace_ops *ops, unsigned char *buf,
> > int len, int reset);
> >
> > This takes the same parameters as ftrace_set_filter() but will add the
> > functions it finds to not be traced. This doesn't remove them from the
> > filter itself, but keeps them from being traced. If @reset is set,
> > the filter is cleaded but the functions that match @buf will still not
>
> cleaded? :)

Hmm, I'll have to be more descriptive.

>
> > be traced (the callback will not be called on those functions).
>
> So how do you clead the "notrace" list?

With passing in reset non-zero. I'll add that.

Thanks for reviewing!

-- Steve