Re: [PATCH 4/6] trace: trace syscall in its handler not from ptracehandler

From: Frederic Weisbecker
Date: Fri Mar 30 2012 - 08:07:07 EST

On Thu, Mar 29, 2012 at 03:40:17PM -0700, David Sharp wrote:
> On Thu, Mar 29, 2012 at 1:06 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> > I had a long discussion with Frederic over IRC earlier today.  We came
> > up with the following strawman:
> >
> > 1. A system call thunk (which could be enabled/disabled by patching the
> > syscall table.)  This provides an entry and exit hook, and also sets a
> > per-thread flag to capture userspace traffic.
> Our goal is for syscall traces to be as fast as regular tracepoints.
> iirc, What we've found is that much of the extra overhead of syscall
> tracepoints as compared to regular tracepoints is due to that the code
> path for syscall tracing is bundled with checks for ptrace and other
> stuff (Vaibhav did all this characterization, he can jump in with
> details if wanted). How much work would this "thunk" have to do that
> is not either recording the trace or calling the syscall?
> >
> > 2. Instrumenting get_user/put_user/copy_from_user/copy_to_user to
> > capture traffic to userspace.  This captures the *full* set of system
> > call arguments, including things addressed via pointers.  Furthermore,
> > it captures the exact versions fed to or returned from the kernel, and
> > deals with data-dependent collection like ioctl().
> Do I understand correctly that you are thinking to copy tho contents
> of those buffers into the ring buffer? This sounds useful. However I
> think it should be optional and the number of bytes copied should be
> limited (tunable). On highly utilized systems, we don't always have a
> lot of memory to dedicate to the ring bufffer, so filling it with the
> contents of, eg, the payload of "read" or "write" would not be
> acceptable under those circumstances. And since events in the ring
> buffer can't cross page boundaries, at some threshold this will cause
> an unacceptable level of unutilized space in the ring buffer.
> (For context, this is coming from the folks that added "tiny" versions
> of syscall tracepoints that only put 16 bits of arg0 into the ring
> buffer so we can get longer trace durations.)

BTW, since tracing overhead (in terms of volume and throughput) is
important for you guys, have you considered adding some option to ftrace
to ignore the "common" fields on the trace record:

field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int common_padding; offset:8; size:4; signed:1;

I think you talked about that on the last kernel summit. This would be
interesting for everyone.

You can find out the pid on top of sched switch events. The rest is probably useless
most of the time.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at