Re: [RFC PATCH 0/4] perf: Correlating user process data to samples

From: Mathieu Desnoyers
Date: Fri Apr 12 2024 - 14:33:15 EST


On 2024-04-12 12:28, Beau Belgrave wrote:
On Thu, Apr 11, 2024 at 09:52:22PM -0700, Ian Rogers wrote:
On Thu, Apr 11, 2024 at 5:17 PM Beau Belgrave <beaub@xxxxxxxxxxxxxxxxxxx> wrote:

In the Open Telemetry profiling SIG [1], we are trying to find a way to
grab a tracing association quickly on a per-sample basis. The team at
Elastic has a bespoke way to do this [2], however, I'd like to see a
more general way to achieve this. The folks I've been talking with seem
open to the idea of just having a TLS value for this we could capture

Presumably TLS == Thread Local Storage.


Yes, the initial idea is to use thread local storage (TLS). It seems to
be the fastest option to save a per-thread value that changes at a fast
rate.

upon each sample. We could then just state, Open Telemetry SDKs should
have a TLS value for span correlation. However, we need a way to sample
the TLS or other value(s) when a sampling event is generated. This is
supported today on Windows via EventActivityIdControl() [3]. Since
Open Telemetry works on both Windows and Linux, ideally we can do
something as efficient for Linux based workloads.

This series is to explore how it would be best possible to collect
supporting data from a user process when a profile sample is collected.
Having a value stored in TLS makes a lot of sense for this however
there are other ways to explore. Whatever is chosen, kernel samples
taken in process context should be able to get this supporting data.
In these patches on X64 the fsbase and gsbase are used for this.

An option to explore suggested by Mathieu Desnoyers is to utilize rseq
for processes to register a value location that can be included when
profiling if desired. This would allow a tighter contract between user
processes and a profiler. It would allow better labeling/categorizing
the correlation values.

It is hard to understand this idea. Are you saying stash a cookie in
TLS for samples to capture to indicate an activity? Restartable
sequences are about preemption on a CPU not of a thread, so at least
my intuition is that they feel different. You could stash information
like this today by changing the thread name which generates comm
events. I've wondered about having similar information in some form of
reserved for profiling stack slot, for example, to stash a pointer to
the name of a function being interpreted. Snapshotting all of a stack
is bad performance wise and for security. A stack slot would be able
to deal with nesting.


You are getting the idea. A slot or tag for a thread would be great! I'm
not a fan of overriding the thread comm name (as that already has a
use). TLS would be fine, if we could also pass an offset + size + type.

Maybe a stack slot that just points to parts of TLS? That way you could
have a set of slots that don't require much memory and selectively copy
them out of TLS (or where ever those slots point to in user memory).

When I was talking to Mathieu about this, it seems that rseq already had
a place to potentially put these slots. I'm unsure though how the per
thread aspects would work.

Mathieu, can you post your ideas here about that?

Sure. I'll try to summarize my thoughts here. By all means, let me
know if I'm missing important pieces of the puzzle.

First of all, here is my understanding of what information we want to
share between userspace and kernel.

A 128-bit activity ID identifies "uniquely" (as far as a 128-bit random
UUID allows) a portion of the dependency chain involved in doing some
work (e.g. answer a HTTP request) across one or many participating hosts.

Activity IDs have a parent/child relationship: a parent activity ID can
create children activity IDs.

For instance, if one host has the service "dispatch", another host
has a "web server", and a third host has a SQL database, we should
be able to follow the chain of activities needed to answer a web
query by following those activity IDs, linking them together
through parent/child relationships. This usually requires the
communication protocols to convey those activity IDs across hosts.

The reason why this information must be provided from userspace is
because it's userspace that knows where to find those activity IDs
within its application-layer communication protocols.

With tracing, taking a full trace of the activity ID spans begin/end
from all hosts allow reconstructing the activity IDs parent/child
relationships, so we typically only need to extract information about
activity ID span begin/end with parent/child info to a tracer.

Using activity IDs from a kernel profiler is trickier, because
we do not have access to the complete span begin/end trace to
reconstruct the activity ID parent/child relationship. This is
where I suspect we'd want to introduce a notion of "activity ID
stack", so a profiler could reconstruct the currently active
stack of activity IDs for the current thread by walking that
stack.

This profiling could be triggered either from an interrupt
(sampling use-case), which would then walk the user-space data
on return-to-userspace as noted by Peter, or it could also be
triggered from a system call return to userspace. This latter
option would make it possible to correlate system calls with
their associated activity ID stacks.

The basic scenario is simple enough: a thread pushes a new
current activity ID (starts a span), possibly nests other
spans, and ends them. It all happens neatly within a single
thread.

More advanced scenarios require more thoughts:

- non-blocking communication, where a thread can hop between
different requests. Technically, it should be able to swap
its current activity ID stack as it swaps handled requests.

- green threads (userspace scheduler): the userspace scheduler
should be able to swap the activity ID stack of the current
thread when swapping between user level threads.

- task "posting" (e.g. work queues types of work dispatch):
the activity ID stacks should probably be handed over with
the work item, and set as current activity ID stack by the
worker thread.

- exceptions should be able to restore the activity ID stack
from a previously saved state.

- Interpreters, JITs. Not sure about the constraints there, we
may need support from the runtimes.

Those activity IDs are frequently updated, so going through a
system call each time would be a non-starter. This is where
thinking in terms of sharing a per-thread data structure
(populated by user-space, read by the kernel) becomes relevant.

A few words about how the rseq(2) system call could help: the
main building block of rseq is a per-thread "struct rseq" ABI,
which is registered by libc on thread creation, and is guaranteed
to be accessible at an offset from the thread pointer in
userspace, and can be accessed using the task struct "rseq" pointer
from the kernel (in contexts that can handle page faults).

Since Linux v6.3 the rseq structure becomes extensible, and we're
working with GNU libc to add support for this extension scheme [1].
So even though the primary use-case for rseq was to enable per-cpu
data structures in user-space, it can be used for other purposes
where shared per-thread data is needed between kernel and userspace.

We could envision adding a new field to struct rseq which would contain
the top level pointer to the current "activity ID stack". The layout
of this stack would have to be defined as a kernel ABI. We'd want
to support push/pop of activity IDs from that stack, as well as
moving all of or portions of the activity ID stack somewhere else,
as well as saving/recovering the stack from a saved state to accommodate
the "advanced" scenarios described above (and probably other scenarios
I'm missing).

rseq(2) could also play a role in letting the kernel expose a seed to
be used for generation of random activity IDs through yet another new
struct rseq field if this happens to be relevant. It could either be
a seed, or just a generation counter to be used to check whether the
seed needs to be regenerated after sleep/hibernate/fork/clone [2].

I'm hoping all this makes some sense, or at least highlights holes
in my understanding. Feedback is welcome!

Thanks,

Mathieu

[1] https://sourceware.org/pipermail/libc-alpha/2024-March/155587.html
[2] https://sourceware.org/pipermail/libc-alpha/2024-March/155577.html

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com