Re: [PATCH] perf record: Add snapshot mode support for perf's regular events

From: Yunlong Song
Date: Wed Nov 25 2015 - 07:45:06 EST

Next message: Wanpeng Li: "Re: [PATCH] KVM: x86: don't expose syscall/sysret to intel 32-bit guest"
Previous message: Wenwei Tao: "[PATCH 3/3] nvme: change the interface between nvme and lightnvm"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH] perf record: Add snapshot mode support for perf's regular events"
Next in thread: David Ahern: "Re: [PATCH] perf record: Add snapshot mode support for perf's regular events"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2015/11/24 22:30, Arnaldo Carvalho de Melo wrote:

> Looks like an interesting feature, if for no other reason, to match what
> we have for aux area tracing.
>
> But perhaps having an event in the stream itself that would signal when
> to dump the snapshot would be a better approach?
>
> One that perhaps you create with 'perf probe' or a tracepoint, both
> could use the in-kernel filtering capabilities to specify when to
> trigger that snapshot?
>
> If this snapshotting could happen in the kernel, that would be even
> better, i.e. no need for two buffers (kernel rb + userspace rb) for
> achieving this? Anyway, see below some comments.

Hi, Arnaldo,

Thanks for your suggestions and comments. As for the suggestion of snapshot
mode implemented in kernel ring buffer rather than user space ring buffer, I
have thought about that before I made this patch, the problems are:

1. Memory Handle

If snapshot mode implemented in kernel space ring buffer, what's proper size of
that ring buffer? If the size is small (such as 512k), it may lose the older
part of target info which is really related with the reason causing that snapshot.
For example, a temporary process launches and runs for a while, and it cannot
terminate properly in the end. The real reason is that something is wrong during
the launching stage, however, we do not know this prior knowledge and has to
record all the information from the beginning to the end of this process's liftime.
By analyzing the log of perf.data, we can finally figure out that the problem exists
in the launching stage. Here if kernel space ring buffer is not big enough, we will
lose the older part of the logs.

In this case, we may have to set the size of kernel space ring buffer big enough, but
can we kmalloc or vmalloc such big size easily in kernel space in any case? There
may be some limitations to do this. Since allocating memory in the kernel is not as
simple as allocating memory in user space. From this point of view, I prefer to use
user space ring buffer to store the records. It's more flexible and feasible to handle
memory.

2. Overwrite Overhead

Without overwrite feature, snapshot mode implemented in kernel space ring buffer
will also lose target info. For example, the mmap page of kernel space ring buffer
is not full at time A. The target info is filled to this mmap page and then it is
full at time B, then it wakes up perf to mmap_read all its records. In snapshot mode,
all the records are totally dropped without writing to perf.data. Later at time C,
a signal triggers the perf to dump, but the older part of target info was lost at
time B. This is because kernel space ring buffer does not overwrite itself, and dump
all its content once a page is full.

So, we need "overwrite" support, but it needs further fixes and the corresponding patch
is not applied to perf now probably due to its executive overhead (in the critical path,
not very sure yet).

Thus instead of incurring any probable overhead in the kernel space and thus may cause
any problems, I prefer to implement the "overwrite" feature in the user space ring
buffer.

3. Metadata Handle

Even if problem 2 above is solved, i.e., we finally fix the overwrite feature of kernel space
ring buffer. What about metadata handling (mmap_event, comm_event, fork_event, exit_event,
etc.)? When it's time to overwrite this info during the wraparound procedure, this metadata
events have to be handled separately to wake up perf (trigger the poll) to write to perf.data.
This may increase complexity in the kernel space. However, as for the user space ring buffer
version, it directly writes to perf.data once it notices the metadata event, no action of waking
up perf by poll at all. Since the user space ring buffer design has to parse each event to
figure out its size to skip when overwriting, it can easily figure out the event type by the
way (which event is perf_record_sample, and which event is metadata).

However, for this problem, there is a under-discussing solution. As what Wang Nan replies, "For
such tracking events (PERF_RECORD_FORK...), we have dummy event so it is possible for us to receive
tracking events from a separated channel, therefore we don't have to parse every events to pick
those events out."

4. Trigger Semantic

As what you mentioned, snapshot mode implemented in kernel space ring buffer "can" use the in-kernel
filtering capabilities to specify when to trigger that snapshot. In my personal opinion, I regard
the in-kernel filtering capabilities as something all about controlling, or say, programming with
the tracepoints/trace events, which does not relate to the semantic of snapshot. You can use the
in-kernel filtering capabilities (such as eBPF) to enable/disable tracing, filter tracing, aggregate
tracing, but all this tracing info should be written to perf.data normally. However, we can decide
when to really do this writing action via snapshot mode. Snapshot mode is just a semantic in the
user space from a high level view. Without snapshot mode, those "in-kernel filtering capabilities"
can still work for the regular perf. In the production case, apps can send SIGUSR2 signal to perf to
easily record the most useful info which is strongly related with the apps performance problem, no
redundant info any more (only collect the trace at the time any problem happens).

> How about: snapshot_ring_buffer? If that is deemed too big, perhaps
> snapshot_rb?
>
> Please continue following the existing convention and name this:
>
> static int record__write_rb()
>
> As the buffer you're writing is a ring buffer, thus needs checking about
> wrap around, like you do here.
>
> This really looks hackish, using a global to change the behaviour of
> some existing function, and a volatile at that, ouch, please change the
> prototypes in some way to avoid globals like this.

Thanks for your careful review and helpful comments, my patch stands for certain kind of design for
snapshot mode. And other people may have suggestions and different designs, let's talk and compare
different designs first. If the design of this patch is taken finally, I will rewrite it then.

--
Thanks,
Yunlong Song

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Wanpeng Li: "Re: [PATCH] KVM: x86: don't expose syscall/sysret to intel 32-bit guest"
Previous message: Wenwei Tao: "[PATCH 3/3] nvme: change the interface between nvme and lightnvm"
In reply to: Arnaldo Carvalho de Melo: "Re: [PATCH] perf record: Add snapshot mode support for perf's regular events"
Next in thread: David Ahern: "Re: [PATCH] perf record: Add snapshot mode support for perf's regular events"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]