RE: [PATCH 1/3] perf/x86: Add perf text poke event

From: Hunter, Adrian
Date: Thu Dec 19 2019 - 11:50:44 EST


> -----Original Message-----
> From: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Sent: Thursday, December 19, 2019 3:09 PM
> To: Hunter, Adrian <adrian.hunter@xxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>; Borislav Petkov <bp@xxxxxxxxx>; H .
> Peter Anvin <hpa@xxxxxxxxx>; x86@xxxxxxxxxx; Mark Rutland
> <mark.rutland@xxxxxxx>; Alexander Shishkin
> <alexander.shishkin@xxxxxxxxxxxxxxx>; Mathieu Poirier
> <mathieu.poirier@xxxxxxxxxx>; Leo Yan <leo.yan@xxxxxxxxxx>; Arnaldo
> Carvalho de Melo <acme@xxxxxxxxxx>; Jiri Olsa <jolsa@xxxxxxxxxx>; linux-
> kernel@xxxxxxxxxxxxxxx
> Subject: Re: [PATCH 1/3] perf/x86: Add perf text poke event
>
> On Wed, Dec 18, 2019 at 04:26:16PM +0200, Adrian Hunter wrote:
> > Record changes to kernel text (i.e. self-modifying code) in order to
> > support tracers like Intel PT decoding through jump labels.
>
> I don't get the obsession with just jump-labels, we need a solution all
> modifying code. The fentry site usage is increasing, and optprobes are also
> fairly popular with a bunch of people.

Yes we need a solution for all modifying code. Jump labels are just a good
place to start because they are the biggest problem by far.

>
> > A copy of the running kernel code is needed as a reference point (e.g.
> > from /proc/kcore). The text poke event records the old bytes and the
> > new bytes so that the event can be processed forwards or backwards.
> >
> > In the case of Intel PT tracing, the executable code must be walked to
> > reconstruct the control flow. For x86 a jump label text poke begins:
> > - write INT3 byte
> > - IPI-SYNC
> > - write instruction tail
> > At this point the actual control flow will be through the INT3 and
> > handler and not hit the old or new instruction. Intel PT outputs
> > FUP/TIP packets for the INT3, so the flow can still be decoded.
> Subsequently:
> > - emit RECORD_TEXT_POKE with the new instruction
> > - IPI-SYNC
> > - write first byte
> > - IPI-SYNC
> > So before the text poke event timestamp, the decoder will see either
> > the old instruction flow or FUP/TIP of INT3. After the text poke event
> > timestamp, the decoder will see either the new instruction flow or
> > FUP/TIP of INT3. Thus decoders can use the timestamp as the point at
> > which to modify the executable code.
>
> I feel a much better justification for the design can be found in the discussion
> we've had around ARM-CS.
>
> Basically SMP instruction coherency mandates something like this, it is just a
> happy accident x86 already had all the bits in place.
>
> How is something like:
>
> "Record (single instruction) changes to the kernel text (i.e.
> self-modifying code) in order to support tracers like Intel PT and ARM
> CoreSight.
>
> A copy of the running kernel code is needed as a reference point (e.g.
> from /proc/kcore). The text poke event records the old bytes and the new
> bytes so that the event can be processed forwards or backwards.
>
> The basic problem is recording the modified instruction in an unambiguous
> manner given SMP instruction cache (in)coherence. That is, when modifying
> an instruction concurrently any solution with one or multiple timestamps is
> not sufficient:
>
> CPU0 CPU1
> 0
> 1 write insn A
> 2 execute insn A
> 3 sync-I$
> 4
>
> Due to I$, CPU1 might execute either the old or new A. No matter where we
> record tracepoints on CPU0, one simply cannot tell what CPU1 will have
> observed, except that at 0 it must be the old one and at 4 it must be the new
> one.
>
> To solve this, take inspiration from x86 text poking, which has to solve this
> exact problem due to variable length instruction encoding and I-fetch
> windows.
>
> 1) overwrite the instruction with a breakpoint and sync I$
>
> This guarantees that that code flow will never hit the target instruction
> anymore, on any CPU (or rather, it will cause an exception).
>
> 2) issue the TEXT_POKE event
>
> 3) overwrite the breakpoint with the new instruction and sync I$
>
> Now we know that any execution after the TEXT_POKE event will either
> observe the breakpoint (and hit the exception) or the new instruction.
>
> So by guarding the TEXT_POKE event with an exception on either side; we
> can now tell, without doubt, which instruction another CPU will have
> observed."
>
> ?

Ok

>
> > Signed-off-by: Adrian Hunter <adrian.hunter@xxxxxxxxx>
> > ---
> > arch/x86/kernel/alternative.c | 37 +++++++++++++-
>
> I'm thinking it might make sense to do this x86 part in a separate patch, and
> just present the generic thing first:
>
> > include/linux/perf_event.h | 6 +++
> > include/uapi/linux/perf_event.h | 19 ++++++-
> > kernel/events/core.c | 87 ++++++++++++++++++++++++++++++++-
> > 4 files changed, 146 insertions(+), 3 deletions(-)
> >
>
> > @@ -1006,6 +1007,22 @@ enum perf_event_type {
> > */
> > PERF_RECORD_BPF_EVENT = 18,
> >
> > + /*
> > + * Records changes to kernel text i.e. self-modified code.
> > + * 'len' is the number of old bytes, which is the same as the number
> > + * of new bytes. 'bytes' contains the old bytes followed immediately
> > + * by the new bytes.
> > + *
> > + * struct {
> > + * struct perf_event_header header;
> > + * u64 addr;
> > + * u16 len;
> > + * u8 bytes[];
>
> Would it make sense to have something like:
>
> * u16 old_len;
> * u16 new_len;
> * u8 old_bytes[old_len];
> * u8 new_bytes[new_len];
>
> That would allow using this for (short) trampolines (ftrace, optprobes etc..).
> {old_len=0, new_len>0} would indicate a new trampoline, while {old_len>0,
> new_len=0} would indicate the dissapearance of a trampoline.

Yes that makes sense.

>
> > + * struct sample_id sample_id;
> > + * };
> > + */
> > + PERF_RECORD_TEXT_POKE = 19,
> > +
> > PERF_RECORD_MAX, /* non-ABI */
> > };
>
> Then the x86 patch, hooking up the event, can also cover kprobes and stuff.

Ok