Re: [PATCH RFC 1/6] perf/x86: Add perf text poke event

From: Peter Zijlstra
Date: Fri Nov 01 2019 - 06:05:04 EST


On Thu, Oct 31, 2019 at 03:31:36PM +0800, Leo Yan wrote:

> Before move farward, I'd like to step back to describe clearly what's
> current problem on Arm64 and check one question for jump label:
>
> I checked the kernel code, both kprobe and ftrace both uses
> stop_machine() to alter instructions,

That's not currect for Aargh64, see aarch64_insn_patch_text_nosync(),
which is used in both ftrace and jump_label.

> since all CPUs run into stop
> machine's synchronization, there have no race condition between
> instructions transition and CPUs execte the altered instruction; thus
> it's safe for kprobe and ftrace to use perf event PERF_TEXT_POKE_UPDATE
> to notify instruction transition and can allow us to read out 'correct'
> instruction for decoder.

Agreed, IFF patching happens using stop_machine(), things are easy. ARM
is (so far) exclusively using stop_machine() based text_poking, although
the last time I spoke to Will about this, he said the _nosync stuff is
possible on 32bit too, just nobody has bothered implementing it.

> But for jump label, it doesn't use the stop_machine() and perf event
> PERF_TEXT_POKE_UPDATE will introduce race condition as below (Let's see
> the example for transition from nop to branch):
>
> CPU0 CPU1
> NOP instruction
> `-> static_key_enable()
> `-> aarch64_insn_patch_text_nosync()
> `-> perf event PERF_TEXT_POKE_UPDATE
> -> Execute nop
> instruction
> `-> aarch64_insn_write()
> `-> __flush_icache_range()
>
> Since x86 platform have INT3 as a mediate state, it can avoid the
> race condition between CPU0 (who is do transition) and other CPUs (who
> is possible to execute nop/branch).

Ah, you found the _nosync thing in jump_label, here's the one in ftrace:

arch/arm64/kernel/ftrace.c: if (aarch64_insn_patch_text_nosync((void *)pc, new))

And yes, this is racy.

> > The thing is, as I argued, the instruction state between PRE and POST is
> > ambiguous. This makes it impossible to decode the branch decision
> > stream.
> >
> > Suppose CPU0 emits the PRE event at T1 and the POST event at T5, but we
> > have CPU1 covering the instruction at T3.
> >
> > How do you decide where CPU1 goes and what the next conditional branch
> > is?
>
> Sorry for my not well thought.
>
> I agree that T3 is an uncertain state with below flow:
>
> CPU0 CPU1
> perf event PERF_TEXT_POKE_UPDATE_PRE -> T1
>
> Int3 / NOP -> T3
>
> Int3 / branch -> T3'
>
> perf event PERF_TEXT_POKE_UPDATE_POST -> T5
>
> Except if the trace has extra info and can use old/new instructions
> combination for analysis, otherwise PRE/POST pair events aren't helpful
> for resolve this issue (if trace decoder can do this, then the change in
> kernel will be much simpler).
>
> Below are two potential options we can use on Arm64 platform:
>
> - Change to use stop_machine() for jump label; this might introduce
> performance issue if jump label is altered frequently.
>
> To mitigate the impaction, we can only use stop_machine() when
> detect the perf events are enabled, otherwise will rollback to use
> the old code path.
>
> - We can use breakpoint to emulate the similiar flow with x86's int3,
> thus we can dismiss the race condition between one CPU alters
> instruction and other CPUs run into the alternative instruction.
>
> @Will, @Mark, could you help review this? Appreciate any comments
> and suggestions. And please let me know if you want to consolidate
> related works with your side (or as you know if there have ongoing
> discussion or someone works on this).

Given people are building larger Aargh64 machines (I've heard about 100+
CPUs already), I'm thinking the 3rd option is the most performant.

But yes, as you mention earlier, we can make this optional on the
TEXT_POKE_UPDATE event being in use.

I'm thinking something along the lines of:

static uintptr_t nosync_addr;
static u32 nosync_insn;

int __kprobes aarch64_insn_patch_text_nosync(void *addr, u32 insn)
{
const u32 break = // some_breakpoint_insn;
uintptr_t tp = (uintptr_t)addr;
int ret;

lockdep_assert_held(&text_mutex);

/* A64 instructions must be word aligned */
if (tp & 0x3)
return -EINVAL;

if (perf_text_poke_update_enabled()) {

nosync_insn = insn;
smp_store_release(&nosync_addr, tp);

ret = aarch64_insn_write(addr, break);
if (ret == 0)
__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);

perf_event_text_poke(....);
}

ret = aarch64_insn_write(addr, insn);
if (ret == 0)
__flush_icache_range(tp, tp + AARCH64_INSN_SIZE);

return ret;
}

And have the 'break' handler do:

aarch64_insn_break_handler(struct pt_regs *regs)
{
unsigned long addr = smp_load_acquire(&nosync_addr);
u32 insn = nosync_insn;

if (regs->ip != addr)
return;

// emulate @insn
}

I understood from Will the whole nosync scheme only works for a limited
set of instructions, but you only have to implement emulation for the
actual instructions used of course.

(which is what we do on x86)

Does this sound workable?