Re: [PATCH 1/3] perf/x86: Add perf text poke event
From: Peter Zijlstra
Date: Thu Dec 19 2019 - 08:09:30 EST
On Wed, Dec 18, 2019 at 04:26:16PM +0200, Adrian Hunter wrote:
> Record changes to kernel text (i.e. self-modifying code) in order to
> support tracers like Intel PT decoding through jump labels.
I don't get the obsession with just jump-labels, we need a solution all
modifying code. The fentry site usage is increasing, and optprobes are
also fairly popular with a bunch of people.
> A copy of the running kernel code is needed as a reference point
> (e.g. from /proc/kcore). The text poke event records the old bytes
> and the new bytes so that the event can be processed forwards or backwards.
>
> In the case of Intel PT tracing, the executable code must be walked to
> reconstruct the control flow. For x86 a jump label text poke begins:
> - write INT3 byte
> - IPI-SYNC
> - write instruction tail
> At this point the actual control flow will be through the INT3 and handler
> and not hit the old or new instruction. Intel PT outputs FUP/TIP packets
> for the INT3, so the flow can still be decoded. Subsequently:
> - emit RECORD_TEXT_POKE with the new instruction
> - IPI-SYNC
> - write first byte
> - IPI-SYNC
> So before the text poke event timestamp, the decoder will see either the
> old instruction flow or FUP/TIP of INT3. After the text poke event
> timestamp, the decoder will see either the new instruction flow or FUP/TIP
> of INT3. Thus decoders can use the timestamp as the point at which to
> modify the executable code.
I feel a much better justification for the design can be found in the
discussion we've had around ARM-CS.
Basically SMP instruction coherency mandates something like this, it is
just a happy accident x86 already had all the bits in place.
How is something like:
"Record (single instruction) changes to the kernel text (i.e.
self-modifying code) in order to support tracers like Intel PT and
ARM CoreSight.
A copy of the running kernel code is needed as a reference point (e.g.
from /proc/kcore). The text poke event records the old bytes and the
new bytes so that the event can be processed forwards or backwards.
The basic problem is recording the modified instruction in an
unambiguous manner given SMP instruction cache (in)coherence. That is,
when modifying an instruction concurrently any solution with one or
multiple timestamps is not sufficient:
CPU0 CPU1
0
1 write insn A
2 execute insn A
3 sync-I$
4
Due to I$, CPU1 might execute either the old or new A. No matter where
we record tracepoints on CPU0, one simply cannot tell what CPU1 will
have observed, except that at 0 it must be the old one and at 4 it
must be the new one.
To solve this, take inspiration from x86 text poking, which has to
solve this exact problem due to variable length instruction encoding
and I-fetch windows.
1) overwrite the instruction with a breakpoint and sync I$
This guarantees that that code flow will never hit the target
instruction anymore, on any CPU (or rather, it will cause an
exception).
2) issue the TEXT_POKE event
3) overwrite the breakpoint with the new instruction and sync I$
Now we know that any execution after the TEXT_POKE event will either
observe the breakpoint (and hit the exception) or the new instruction.
So by guarding the TEXT_POKE event with an exception on either side;
we can now tell, without doubt, which instruction another CPU will
have observed."
?
> Signed-off-by: Adrian Hunter <adrian.hunter@xxxxxxxxx>
> ---
> arch/x86/kernel/alternative.c | 37 +++++++++++++-
I'm thinking it might make sense to do this x86 part in a separate
patch, and just present the generic thing first:
> include/linux/perf_event.h | 6 +++
> include/uapi/linux/perf_event.h | 19 ++++++-
> kernel/events/core.c | 87 ++++++++++++++++++++++++++++++++-
> 4 files changed, 146 insertions(+), 3 deletions(-)
>
> @@ -1006,6 +1007,22 @@ enum perf_event_type {
> */
> PERF_RECORD_BPF_EVENT = 18,
>
> + /*
> + * Records changes to kernel text i.e. self-modified code.
> + * 'len' is the number of old bytes, which is the same as the number
> + * of new bytes. 'bytes' contains the old bytes followed immediately
> + * by the new bytes.
> + *
> + * struct {
> + * struct perf_event_header header;
> + * u64 addr;
> + * u16 len;
> + * u8 bytes[];
Would it make sense to have something like:
* u16 old_len;
* u16 new_len;
* u8 old_bytes[old_len];
* u8 new_bytes[new_len];
That would allow using this for (short) trampolines (ftrace, optprobes
etc..). {old_len=0, new_len>0} would indicate a new trampoline, while
{old_len>0, new_len=0} would indicate the dissapearance of a trampoline.
> + * struct sample_id sample_id;
> + * };
> + */
> + PERF_RECORD_TEXT_POKE = 19,
> +
> PERF_RECORD_MAX, /* non-ABI */
> };
Then the x86 patch, hooking up the event, can also cover kprobes and
stuff.