Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event

From: Will Deacon
Date: Mon Nov 11 2019 - 12:29:44 EST


On Mon, Nov 11, 2019 at 05:05:05PM +0100, Peter Zijlstra wrote:
> On Mon, Nov 11, 2019 at 03:39:25PM +0000, Will Deacon wrote:
>
> > Backing up though, I think I'm missing details about what this thread is
> > trying to achieve. You're adding perf events so that coresight trace can
> > take into account modifications of the kernel text, right?
>
> Yes, because ARM-CS / Intel-PT need to know the exact text at any one
> time in order to correctly decode their traces.
>
> > If so:
> >
> > * Does this need to take into account more than just jump_label()?
>
> jump_label seems to be the initial target Adrian set, but yes, it needs
> to cover 'everything'.

Including alternatives, which are what get me worried since the potential
for recursion is pretty high there (on arm64, at least).

> That is, all single instruction patching crud around:
>
> - optimized kprobes
> (regular kprobes are exempt because they rely on exceptions)
> - jump_labels
> - static_call (still pending but still)
> - ftrace fentry
>
> We also need a solution for whole new chunks of text:
>
> - modules
> - ftrace trampolines
> - optprobe trampolines
> - JIT stuff
>
> but that is so far not included; I had some ideas about a /dev/mem based
> interface that would report new ranges and wait for acks (from open
> file-desc) before freeing them.

I think that it would be nice not to end up with two very different
interfaces for this. But I see this is still just an RFC, so maybe the
full picture will emerge once we solve more of these use-cases.

> > * What consistency guarantees are needed by userspace? It's not clear to
> > me how it correlates the new event with execution on other CPUs. Is this
> > using timestamps or something else?
>
> Something else :-)
>
> Both CS/PT basically have a bit-stream encoding of taken / not-taken
> decisions. To decode they read the text until a conditional
> branch-point, then consume one bit from the stream and continue.
>
> (note how unconditional branches -- jump_labels -- are expected to be
> correct in this scheme)
>
> This means they need an exact representation of the text to be able to
> accurately decode.
>
> This means timestamps or PRE/POST modify events are not acceptible.
> Because until the I-FLUSH has completed we do not know which CPU has
> which version of the instruction.
>
> Instead we rely on exceptions; exceptions are differently encoded in the
> CS/PT data streams.
>
> The scheme used is:
>
> - overwrite target instruction with an exception (INT3 on x86, BRK on arm)
> - sync (IPI broadcast CPUID or I-FLUSH completion)

Hmm. Wouldn't this sync also need to drain the trace buffers for all other
CPUs so that we make sure that the upcoming TEXT_POKE event occurs after
all prior trace data, which could've been from before the breakpoint was
installed?

> at this point we know the instruction _will_ trap and CS/PT can observe
> this alternate flow. That is, the exception handler will emulate the
> instruction.
>
> - emit the TEXT_POKE event with both the old and new instruction
> included
>
> - overwrite the target instruction with the new instruction
> - sync
>
> at this point the new instruction should be valid.
>
> Using this scheme we can at all times follow the instruction flow.
> Either it is an exception and the exception encoding helps us navigate,
> or, on either size, we'll know the old/new instruction.
>
> > * What about module loading?
>
> I mentioned that above; that is still pending.
>
> > Finally, the whole point of the '_nosync()' stuff is that we don't have
> > to synchronise. Therefore, there is a window where you really can't tell
> > whether the instruction has been updated from the perspective of another
> > CPU.
>
> Right, and much of that is preserved with the above scheme, nowhere do
> you have to wait/disturb other CPUs, except for the I-FLUSH completion.
>
> Does this help?

Yes, thank you for explaining it!

Will