Re: [RFC][PATCH 1/2] tracing/ftrace: syscall tracing infrastructure
From: Frank Ch. Eigler
Date: Mon Mar 16 2009 - 18:18:54 EST
Hi -
On Mon, Mar 16, 2009 at 05:45:26PM -0400, Mathieu Desnoyers wrote:
> [...]
> > As far as I know, utrace supports multiple trace-engines on a process.
> > Since ptrace is just an engine of utrace, you can add another engine on utrace.
> >
> > utrace-+-ptrace_engine---owner_process
> > |
> > +-systemtap_module
> > |
> > +-ftrace_plugin
Right. In this way, utrace is simply a multiplexing intermediary.
> > Here, Frank had posted an example of utrace->ftrace engine.
> > http://lkml.org/lkml/2009/1/27/294
> >
> > And here is the latest his patch(which seems to support syscall tracing...)
> > http://git.kernel.org/?p=linux/kernel/git/frob/linux-2.6-utrace.git;a=blob;f=kernel/trace/trace_process.c;h=619815f6c2543d0d82824139773deb4ca460a280;hb=ab20efa8d8b5ded96e8f8c3663dda3b4cb532124
> >
>
> Reminder : we are looking at system-wide tracing here. Here are some
> comments about the current utrace implementation.
>
> Looking at include/linux/utrace.h from the tree
>
> 17 * A tracing engine starts by calling utrace_attach_task() or
> 18 * utrace_attach_pid() on the chosen thread, passing in a set of hooks
> 19 * (&struct utrace_engine_ops), and some associated data. This produces a
> 20 * &struct utrace_engine, which is the handle used for all other
> 21 * operations. An attached engine has its ops vector, its data, and an
> 22 * event mask controlled by utrace_set_events().
>
> So if the system has, say 3000 threads, then we have 3000 struct
> utrace_engine created ? I wonder what effet this could have one
> cachelines if this is used to trace hot paths like system call
> entry/exit. Have you benchmarked this kind of scenario under tbench ?
It has not been a problem, since utrace_engines are designed to be
lightweight. Starting or stopping a systemtap script of the form
probe process.syscall {}
appears to have no noticable impact on a tbench suite.
> 24 * For each event bit that is set, that engine will get the
> 25 * appropriate ops->report_*() callback when the event occurs. The
> 26 * &struct utrace_engine_ops need not provide callbacks for an event
> 27 * unless the engine sets one of the associated event bits.
>
> Looking at utrace_set_events(), we seem to be limited to 32 events on a
> 32-bits architectures because it uses a bitmask ? Isn't it a bit small?
There are only a few types of thread events that involve different
classes of treatment, or different degrees of freedom in terms of
interference with the uninstrumented fast path of the threads.
For example, it does not make sense to have different flag bits for
different system calls, since choosing to trace *any* system call
involves taking the thread off of the fast path with the TIF_ flag.
Once it's off the fast path, it doesn't matter whether the utrace core
or some client performs syscall discrimination, so it is left to the
client.
> 682 /**
> 683 * utrace_set_events_pid - choose which event reports a tracing engine gets
> 684 * @pid: thread to affect
> 685 * @engine: attached engine to affect
> 686 * @eventmask: new event mask
> 687 *
> 688 * This is the same as utrace_set_events(), but takes a &struct pid
> 689 * pointer rather than a &struct task_struct pointer. The caller must
> 690 * hold a ref on @pid, but does not need to worry about the task
> 691 * staying valid. If it's been reaped so that @pid points nowhere,
> 692 * then this call returns -%ESRCH.
>
>
> Comments like "but does not need to worry about the task staying valid"
> does not make me feel safe and comfortable at all, could you explain
> how you can assume that derefencing an "invalid" pointer will return
> NULL ?
(We're doing a final round of "internal" (pre-LKML) reviews of the
utrace implementation right now on utrace-devel@xxxxxxxxxx, where such
comments get fastest attention from the experts.)
For this particular issue, the utrace documentation file explains the
liveness rules for the various pointers that can be fed to or received
from utrace functions. This is not about "feeling" safe, it's about
what the mechanism is deliberately designed to permit.
> About the utrace_attach_task() :
>
> 244 if (unlikely(target->flags & PF_KTHREAD))
> 245 /*
> 246 * Silly kernel, utrace is for users!
> 247 */
> 248 return ERR_PTR(-EPERM);
>
> So we cannot trace kernel threads ?
I'm not quite sure about all the reasons for this, but I believe that
kernel threads don't tend to engage in job control / signal /
system-call activities the same way as normal user threads do.
> 118 /*
> 119 * Called without locks, when we might be the first utrace engine to attach.
> 120 * If this is a newborn thread and we are not the creator, we have to wait
> 121 * for it. The creator gets the first chance to attach. The PF_STARTING
> 122 * flag is cleared after its report_clone hook has had a chance to run.
> 123 */
> 124 static inline int utrace_attach_delay(struct task_struct *target)
> 125 {
> 126 if ((target->flags & PF_STARTING) && target->real_parent != current)
> 127 do {
> 128 schedule_timeout_interruptible(1);
> 129 if (signal_pending(current))
> 130 return -ERESTARTNOINTR;
> 131 } while (target->flags & PF_STARTING);
> 132
> 133 return 0;
> 134 }
>
> Why do we absolutely have to poll until the thread has started ?
(I don't know off the top of my head - Roland?)
> utrace_add_engine()
> set_notify_resume(target);
>
> ok, so this is where the TIF_NOTIFY_RESUME thread flag is set. I notice
> that it is set asynchronously with the execution of the target thread
> (as I do with my TIF_KERNEL_TRACE thread flag).
>
> However, on x86_64, _TIF_DO_NOTIFY_MASK is only tested in
> entry_64.S
>
> int_signal:
> and
> retint_signal:
>
> code paths. However, if there is no syscall tracing to do upon syscall
> entry, the thread flags are not re-read at syscall exit and you will
> miss the syscall exit returning from your target thread if this thread
> was blocked while you set its TIF_NOTIFY_RESUME. Or is it handled in
> some subtle way I did not figure out ? BTW re-reading the TIF flags from
> the thread_info at syscall exit on the fast path is out of question
> because it considerably degrades the kernel performances. entry_*.S is
> a very, very critical path.
(I don't know off the top of my head - Roland?)
- FChE
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/