Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure

From: Frederic Weisbecker
Date: Mon Mar 16 2009 - 19:47:19 EST


On Mon, Mar 16, 2009 at 06:18:00PM -0400, Frank Ch. Eigler wrote:
> Hi -
>
>
> On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
>
> > [...]
> > > As far as I know, utrace supports multiple trace-engines on a process.
> > > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> > >
> > > utrace-+-ptrace_engine---owner_process
> > > |
> > > +-systemtap_module
> > > |
> > > +-ftrace_plugin
>
> Right. In this way, utrace is simply a multiplexing intermediary.
>
>
> > > Here, Frank had posted an example of utrace->ftrace engine.
> > > http://lkml.org/lkml/2009/1/27/294
> > >
> > > And here is the latest his patch(which seems to support syscall tracing...)
> > > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> > >
> >
> > Reminder : we are looking at system-wide tracing here. Here are some
> > comments about the current utrace implementation.
> >
> > Looking at include/linux/utrace.h from the tree
> >
> > 17 * A tracing engine starts by calling utrace_attach_task() or
> > 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> > 19 * (&struct utrace_engine_ops), and some associated data. This produces a
> > 20 * &struct utrace_engine, which is the handle used for all other
> > 21 * operations. An attached engine has its ops vector, its data, and an
> > 22 * event mask controlled by utrace_set_events().
> >
> > So if the system has, say 3000 threads, then we have 3000 struct
> > utrace_engine created ? I wonder what effet this could have one
> > cachelines if this is used to trace hot paths like system call
> > entry/exit. Have you benchmarked this kind of scenario under tbench ?
>
> It has not been a problem, since utrace_engines are designed to be
> lightweight. Starting or stopping a systemtap script of the form
>
> probe process.syscall {}
>
> appears to have no noticable impact on a tbench suite.
>
>
> > 24 * For each event bit that is set, that engine will get the
> > 25 * appropriate ops->report_*() callback when the event occurs. The
> > 26 * &struct utrace_engine_ops need not provide callbacks for an event
> > 27 * unless the engine sets one of the associated event bits.
> >
> > Looking at utrace_set_events(), we seem to be limited to 32 events on a
> > 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
>
> There are only a few types of thread events that involve different
> classes of treatment, or different degrees of freedom in terms of
> interference with the uninstrumented fast path of the threads.
>
> For example, it does not make sense to have different flag bits for
> different system calls, since choosing to trace *any* system call
> involves taking the thread off of the fast path with the TIF_ flag.
> Once it's off the fast path, it doesn't matter whether the utrace core
> or some client performs syscall discrimination, so it is left to the
> client.
>
>
> > 682 /**
> > 683 * utrace_set_events_pid - choose which event reports a tracing engine gets
> > 684 * @pid: thread to affect
> > 685 * @engine: attached engine to affect
> > 686 * @eventmask: new event mask
> > 687 *
> > 688 * This is the same as utrace_set_events(), but takes a &struct pid
> > 689 * pointer rather than a &struct task_struct pointer. The caller must
> > 690 * hold a ref on @pid, but does not need to worry about the task
> > 691 * staying valid. If it's been reaped so that @pid points nowhere,
> > 692 * then this call returns -%ESRCH.
> >
> >
> > Comments like "but does not need to worry about the task staying valid"
> > does not make me feel safe and comfortable at all, could you explain
> > how you can assume that derefencing an "invalid" pointer will return
> > NULL ?
>
> (We're doing a final round of "internal" (pre-LKML) reviews of the
> utrace implementation right now on utrace-devel@xxxxxxxxxx, where such
> comments get fastest attention from the experts.)
>
> For this particular issue, the utrace documentation file explains the
> liveness rules for the various pointers that can be fed to or received
> from utrace functions. This is not about "feeling" safe, it's about
> what the mechanism is deliberately designed to permit.
>
>
> > About the utrace_attach_task() :
> >
> > 244 if (unlikely(target->flags & PF_KTHREAD))
> > 245 /*
> > 246 * Silly kernel, utrace is for users!
> > 247 */
> > 248 return ERR_PTR(-EPERM);
> >
> > So we cannot trace kernel threads ?
>
> I'm not quite sure about all the reasons for this, but I believe that
> kernel threads don't tend to engage in job control / signal /
> system-call activities the same way as normal user threads do.
>


Some of them use some syscalls, but it doesn't involve a user/kernel switch.
So it's not tracable by hooking syscall_entry/exit or using tracehooks.
It would require specific hooks on sys_* functions for that.

So this check is right (writing on each thread info seems somewhat costly so
it's better if it is avoided like here).

Frederic.


> > 118 /*
> > 119 * Called without locks, when we might be the first utrace engine to attach.
> > 120 * If this is a newborn thread and we are not the creator, we have to wait
> > 121 * for it. The creator gets the first chance to attach. The PF_STARTING
> > 122 * flag is cleared after its report_clone hook has had a chance to run.
> > 123 */
> > 124 static inline int utrace_attach_delay(struct task_struct *target)
> > 125 {
> > 126 if ((target->flags & PF_STARTING) && target->real_parent != current)
> > 127 do {
> > 128 schedule_timeout_interruptible(1);
> > 129 if (signal_pending(current))
> > 130 return -ERESTARTNOINTR;
> > 131 } while (target->flags & PF_STARTING);
> > 132
> > 133 return 0;
> > 134 }
> >
> > Why do we absolutely have to poll until the thread has started ?
>
> (I don't know off the top of my head - Roland?)
>
>
> > utrace_add_engine()
> > set_notify_resume(target);
> >
> > ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> > that it is set asynchronously with the execution of the target thread
> > (as I do with my TIF_KERNEL_TRACE thread flag).
> >
> > However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> > entry_64.S
> >
> > int_signal:
> > and
> > retint_signal:
> >
> > code paths. However, if there is no syscall tracing to do upon syscall
> > entry, the thread flags are not re-read at syscall exit and you will
> > miss the syscall exit returning from your target thread if this thread
> > was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> > some subtle way I did not figure out ? BTW re-reading the TIF flags from
> > the thread_info at syscall exit on the fast path is out of question
> > because it considerably degrades the kernel performances. entry_*.S is
> > a very, very critical path.
>
> (I don't know off the top of my head - Roland?)
>
>
> - FChE

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/