Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event

From: Peter Zijlstra
Date: Mon Nov 11 2019 - 11:05:20 EST


On Mon, Nov 11, 2019 at 03:39:25PM +0000, Will Deacon wrote:

> Backing up though, I think I'm missing details about what this thread is
> trying to achieve. You're adding perf events so that coresight trace can
> take into account modifications of the kernel text, right?

Yes, because ARM-CS / Intel-PT need to know the exact text at any one
time in order to correctly decode their traces.

> If so:
>
> * Does this need to take into account more than just jump_label()?

jump_label seems to be the initial target Adrian set, but yes, it needs
to cover 'everything'.

That is, all single instruction patching crud around:

- optimized kprobes
(regular kprobes are exempt because they rely on exceptions)
- jump_labels
- static_call (still pending but still)
- ftrace fentry

We also need a solution for whole new chunks of text:

- modules
- ftrace trampolines
- optprobe trampolines
- JIT stuff

but that is so far not included; I had some ideas about a /dev/mem based
interface that would report new ranges and wait for acks (from open
file-desc) before freeing them.

> * What consistency guarantees are needed by userspace? It's not clear to
> me how it correlates the new event with execution on other CPUs. Is this
> using timestamps or something else?

Something else :-)

Both CS/PT basically have a bit-stream encoding of taken / not-taken
decisions. To decode they read the text until a conditional
branch-point, then consume one bit from the stream and continue.

(note how unconditional branches -- jump_labels -- are expected to be
correct in this scheme)

This means they need an exact representation of the text to be able to
accurately decode.

This means timestamps or PRE/POST modify events are not acceptible.
Because until the I-FLUSH has completed we do not know which CPU has
which version of the instruction.

Instead we rely on exceptions; exceptions are differently encoded in the
CS/PT data streams.

The scheme used is:

- overwrite target instruction with an exception (INT3 on x86, BRK on arm)
- sync (IPI broadcast CPUID or I-FLUSH completion)

at this point we know the instruction _will_ trap and CS/PT can observe
this alternate flow. That is, the exception handler will emulate the
instruction.

- emit the TEXT_POKE event with both the old and new instruction
included

- overwrite the target instruction with the new instruction
- sync

at this point the new instruction should be valid.

Using this scheme we can at all times follow the instruction flow.
Either it is an exception and the exception encoding helps us navigate,
or, on either size, we'll know the old/new instruction.

> * What about module loading?

I mentioned that above; that is still pending.

> Finally, the whole point of the '_nosync()' stuff is that we don't have
> to synchronise. Therefore, there is a window where you really can't tell
> whether the instruction has been updated from the perspective of another
> CPU.

Right, and much of that is preserved with the above scheme, nowhere do
you have to wait/disturb other CPUs, except for the I-FLUSH completion.

Does this help?