Re: [RFC] Full syscall argument decode in "perf trace"

From: Denys Vlasenko
Date: Mon Sep 30 2013 - 07:34:23 EST


On Thu, Sep 26, 2013 at 9:41 AM, Denys Vlasenko
<vda.linux@xxxxxxxxxxxxxx> wrote:
> On Wed, Sep 18, 2013 at 4:33 PM, Arnaldo Carvalho de Melo
> <acme@xxxxxxxxxx> wrote:
>>> The problem: ~100 more tracepoints need to be added merely to get
>>> to the point where strace already is, wrt quality of syscall decoding.
>>> strace has nearly 300 separate custom syscall formatting functions,
>>> some of them quite complex.
>>>
>>> If we need to add syscall stopping feature (which, as I said above,
>>> will be necessary anyway IMO), then syscall decoding can be as good
>>> as strace *already*. Then, gradually more tracepoints are added
>>> to make it faster.
>>>
>>> I am thinking about going into this direction.
>>>
>>> Therefore my question should be restated as:
>>>
>>> Would perf developers accept the "syscall pausing" feature,
>>> or it won't be accepted?
>>
>> Do you have some patch for us to try?
>
> I have a patch which is a bit strace specific: it sidesteps
> the question of the synchronization between traced process
> and its tracer by using ptrace's existing method of reporting stops.
>
> This works for strace, and is very easy to implement.
> Naturally, other tracers (e.g. "perf trace" wouldn't
> want to start using ptrace! Synchronization needs
> to be done in some other way, not as a ptrace stop.
>
> For one, the stopping flag needs to be a counter, so that
> more than one tracer can use this feature concurrently.
>
> But anyway, I am attaching it.
>
> It adds a new flag, attr.sysexit_stop, which makes process stop
> at next syscall exit when this tracepoint overflows.

Here is the next iteration of the work in progress.

I added syscall masks.
This necessitated propagation of pointer to struct pt_regs
which points to userspace registers from sys_{enter,exit}
tracepoints to overflow handling functions, in order to get syscall#.
(Yes, I discovered that pt_regs which was already there wasn't
the *userspace* one).

The patch is tested: I have a modified version of strace
which decodes all syscalls properly and which avoids stopping
on all syscall entries and on a selected few syscall exits too.

As I see it, the next thing to tackle is the stopping method.
(The current patch still uses my old ptrace-specific hack).

How about the following: add a per-task "pause counter".
If it is <= 0, then task is not paused. If it is > 0, task is paused.

When an attached perf fd causes task to pause, the counter
is incremented, a marker is written into the perf buffer,
and task goes to sleep.

When tracer process sees the marker, it commands traced
process to "unpause", which decrements the counter.

Why this way?
* this allows traced process to be paused by several tracers
at once.
* this does not need heavy-weight notifications to be sent
to tracers (unlike my current hack, which invokes the
waitpid notification machinery, the source of much of strace's
slowness).
* it might work even if counter increment is reordered
relative to perf marker writing. if tracer sees the marker,
it can "unpause" - decrement counter and cause it to go -1.
The task is not paused (the rule is "<= 0", not "= 0").
Then kernel increments the counter, it's 0 now,
and task is still not paused. (I'm not sure whether
such property is useful, but if it is, we have it - good :)

The downside is, we'd need one new field in task struct.

Does this look sensible to you?

Attachment: perf_trace_stop_RFC_v2.diff
Description: Binary data