Re: [PATCH 4/6] trace: trace syscall in its handler not from ptrace handler

From: David Sharp
Date: Thu Mar 29 2012 - 18:40:47 EST


On Thu, Mar 29, 2012 at 1:06 PM, H. Peter Anvin <hpa@xxxxxxxxx> wrote:
> I had a long discussion with Frederic over IRC earlier today. ÂWe came
> up with the following strawman:
>
> 1. A system call thunk (which could be enabled/disabled by patching the
> syscall table.) ÂThis provides an entry and exit hook, and also sets a
> per-thread flag to capture userspace traffic.

Our goal is for syscall traces to be as fast as regular tracepoints.
iirc, What we've found is that much of the extra overhead of syscall
tracepoints as compared to regular tracepoints is due to that the code
path for syscall tracing is bundled with checks for ptrace and other
stuff (Vaibhav did all this characterization, he can jump in with
details if wanted). How much work would this "thunk" have to do that
is not either recording the trace or calling the syscall?

>
> 2. Instrumenting get_user/put_user/copy_from_user/copy_to_user to
> capture traffic to userspace. ÂThis captures the *full* set of system
> call arguments, including things addressed via pointers. ÂFurthermore,
> it captures the exact versions fed to or returned from the kernel, and
> deals with data-dependent collection like ioctl().

Do I understand correctly that you are thinking to copy tho contents
of those buffers into the ring buffer? This sounds useful. However I
think it should be optional and the number of bytes copied should be
limited (tunable). On highly utilized systems, we don't always have a
lot of memory to dedicate to the ring bufffer, so filling it with the
contents of, eg, the payload of "read" or "write" would not be
acceptable under those circumstances. And since events in the ring
buffer can't cross page boundaries, at some threshold this will cause
an unacceptable level of unutilized space in the ring buffer.

(For context, this is coming from the folks that added "tiny" versions
of syscall tracepoints that only put 16 bits of arg0 into the ring
buffer so we can get longer trace durations.)

>
> This has to be done with extreme care to avoid introducing overhead in
> the no-tracing case, however, as these functions are extraordinarily
> performance sensitive. ÂThis probably will require careful patching in
> the first enable/last disable case.
>
> 3. There will need to be userspace tools written to decode the resulting
> trace buffer. ÂThis is pretty much needed anyway, but once you throw in
> complex data structures it becomes even more so. ÂA trace will basically
> consist of:
>
> SYSCALL_ENTRY <syscall number> <arg1..6>
> COPY_FROM_USER <address> <data>
> Â...
> COPY_TO_USER <address> <data>
> Â...
> SYSCALL_EXIT <return value>
>
> Outputting this in human-readable format requires some reasonably
> sophisticated logic, but the *HUGE* advantage is that not only is all
> the information there, it is *correct by construction*.
>
> Â Â Â Â-hpa
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/