Re: [PATCH] perf cs-etm: stamp pid/tid/EL on each buffered packet to fix cross-pid attribution

From: James Clark

Date: Tue May 26 2026 - 07:21:39 EST

On 15/05/2026 3:11 am, Amir Ayupov wrote:

In a system-wide `perf record -e cs_etm/.../u` capture on aarch64,
synthesized samples emitted by `perf script --itrace=il64` are
sometimes attributed to the WRONG sample.pid/tid (and to the wrong
EL/cpumode) for the chunk of branches that straddle a context-switch
boundary on a CPU. A branch actually retired by process A is emitted
with sample.pid set to the thread that next ran on the same CPU.

Mechanism:
1. ETM emits CONTEXTIDR/EL packets in-stream when the kernel updates
CONTEXTIDR_EL1 on context switch / EL change. OpenCSD turns these
into OCSD_GEN_TRC_ELEM_PE_CONTEXT elements interleaved with
OCSD_GEN_TRC_ELEM_INSTR_RANGE elements for retired branch ranges.
2. cs_etm_decoder__buffer_range() queues each INSTR_RANGE into
packet_queue->packet_buffer[]; packets carry start/end addrs,
instr_count, last-instruction info, etc., but NO owner identity.
3. PE_CONTEXT goes through cs_etm_decoder__set_tid() ->
cs_etm__set_thread(), which immediately mutates tidq->thread and
tidq->el. Queued packets are not drained first; reset_timestamp()
is called so the next TIMESTAMP triggers OCSD_RESP_WAIT and a
drain.
4. By drain time in cs_etm__process_traceid_queue() ->
cs_etm__sample(), sample.pid/tid is read from the now-mutated
tidq->thread and sample.cpumode from the now-mutated tidq->el.
Pre-context INSTR_RANGEs get the post-context owner.

The same race affects branch samples via tidq->prev_packet_thread /
tidq->prev_packet_el, captured at packet-swap time from
tidq->thread / tidq->el (which may already have flipped).

This is independent of PERF_RECORD_SWITCH_CPU_WIDE, which is
deliberately not used to assign sample identity in this path. The
bug applies to any cs_etm capture with in-stream CONTEXTIDR
(PIDFMT_CTXTID or PIDFMT_CTXTID2).

Effect on downstream tools: branches that should belong to the
previous thread on the CPU get attributed to the next thread. When
the two threads share a binary, leaked branches' VAs land in the
wrong thread's mappings; samples whose IPs land in r-x mappings
silently pollute that binary's profile, while samples landing in
R-only/RW mappings show up as out-of-range / non-text samples.
Either way, AutoFDO/BOLT profiles built from `perf script --itrace`
output of system-wide cs_etm captures contain misattributed samples.

Concrete example from `perf script --itrace=il64` of the same
captured branch (same timestamp, same IP, same from/to addrs) before
and after this fix:

before: launcher_multia 2638146/2638146 705897.219172: \
fffcda6b124c 0xfffcda641958/0xfffcda6b123c
after: ws-tcf-sr-io13 2736581/2741587 705897.219172: \
fffcda6b124c 0xfffcda641958/0xfffcda6b123c

The branch was retired by ws-tcf-sr-io13 (tid 2741587) but, before
the fix, was attributed to launcher_multia (the next thread to run on
that CPU after the context switch). After the fix, it is correctly
attributed to ws-tcf-sr-io13.

Why not "drain on PE_CONTEXT then switch" (deferred-set_thread):
tidq->thread has two consumers \u2014 sample emission needs the OUTGOING
identity for queued packets, but cs_etm__mem_access() needs the
CURRENT thread's maps to fetch instruction bytes for OpenCSD. The
two needs are temporally inverted; a single tidq->thread cannot
serve both. Keeping tidq->thread current and stamping owner identity
per packet is the only design that decouples them cleanly.

Fix: capture the owning pid/tid/EL on each buffered packet at
cs_etm_decoder__buffer_packet() time (before any subsequent
PE_CONTEXT can mutate tidq->thread / tidq->el), and read them at
sample emission time.

- struct cs_etm_packet gains pid_t pid, pid_t tid, int el (storing
an ocsd_ex_level value; typed as int so the struct does not
depend on OpenCSD headers, which are only included inside
HAVE_CSTRACE_SUPPORT).
- cs_etm__etmq_get_pid_tid_el() (formerly cs_etm__etmq_get_pid_tid)
returns all three.
- cs_etm__synth_instruction_sample() reads sample.pid / sample.tid
from tidq->packet->{pid,tid} and derives sample.cpumode from
tidq->packet->el.
- cs_etm__synth_branch_sample() reads sample.pid / sample.tid /
cpumode from tidq->prev_packet->{pid,tid,el}.
- The separate prev_packet_thread / prev_packet_el bookkeeping in
cs_etm__packet_swap() / cs_etm__init_traceid_queue() /
cs_etm__free_traceid_queues() is removed; the per-packet stamp
on prev_packet now carries that information.

Cost: 12 bytes added to struct cs_etm_packet (~12-16 KB per
packet_queue with CS_ETM_PACKET_MAX_BUFFER=1024), 16 bytes saved per
cs_etm_traceid_queue (one struct thread * + one ocsd_ex_level).

A residual gap: cs_etm__copy_insn() reads sample.insn bytes via
cs_etm__mem_access(), which still uses tidq->thread (the current
thread), so the inline insn bytes for an outgoing-thread sample may
be looked up against the wrong address space. Fixing this requires
threading the packet's owner pid through cs_etm__mem_access and is
left for a follow-up. sample.ip / sample.pid attribution \u2014 what
AutoFDO/BOLT consume \u2014 is correct.

Hi Amir,

Can you test the patch here to see if it fixes your issue [1]?

We thought it didn't make sense to store the thread on every packet when there is only one active thread for the decoder and one for sample generation. We also fixed the other issue mentioned above about cs_etm__copy_insn() not working.

Thanks
James

[1]: https://lore.kernel.org/linux-perf-users/20260526-james-cs-context-tracking-fix-v1-0-ebd602e18287@xxxxxxxxxx/T/#t