Re: [PATCH 2/4] arm64: implement support for static call trampolines

From: Peter Zijlstra
Date: Mon Oct 25 2021 - 09:57:11 EST


On Mon, Oct 25, 2021 at 02:21:00PM +0200, Frederic Weisbecker wrote:

> +#define __ARCH_DEFINE_STATIC_CALL_TRAMP(name, insn) \
> + asm(" .pushsection .static_call.text, \"ax\" \n" \
> + " .align 4 \n" \
> + " .globl " STATIC_CALL_TRAMP_STR(name) " \n" \
> + "0: .quad 0x0 \n" \
> + STATIC_CALL_TRAMP_STR(name) ": \n" \
> + " hint 34 /* BTI C */ \n" \
> + insn " \n" \
> + " ldr x16, 0b \n" \
> + " cbz x16, 1f \n" \
> + " br x16 \n" \
> + "1: ret \n" \
> + " .popsection \n")

> +void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
> +{
> + /*
> + * -0x8 <literal>
> + * 0x0 bti c <--- trampoline entry point
> + * 0x4 <branch or nop>
> + * 0x8 ldr x16, <literal>
> + * 0xc cbz x16, 20
> + * 0x10 br x16
> + * 0x14 ret
> + */
> + struct {
> + u64 literal;
> + __le32 insn[2];
> + } insns;
> + u32 insn;
> + int ret;
> +
> + insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_BTIC);
> + insns.literal = (u64)func;
> + insns.insn[0] = cpu_to_le32(insn);
> +
> + if (!func) {
> + insn = aarch64_insn_gen_branch_reg(AARCH64_INSN_REG_LR,
> + AARCH64_INSN_BRANCH_RETURN);
> + } else {
> + insn = aarch64_insn_gen_branch_imm((u64)tramp + 4, (u64)func,
> + AARCH64_INSN_BRANCH_NOLINK);
> +
> + /*
> + * Use a NOP if the branch target is out of range, and rely on
> + * the indirect call instead.
> + */
> + if (insn == AARCH64_BREAK_FAULT)
> + insn = aarch64_insn_gen_hint(AARCH64_INSN_HINT_NOP);
> + }
> + insns.insn[1] = cpu_to_le32(insn);
> +
> + ret = __aarch64_insn_write(tramp - 8, &insns, sizeof(insns));

OK, that's pretty magical...

So you're writing the literal and the two instructions with 2 u64
stores. Relying on alignment to guarantee both are in a single page and
that copy_to_kernel_nofault() selects u64 writes.

By unconditionally writing the literal, you avoid there ever being an
stale value, which in turn avoids there being a race where you switch
from 'J @func' relative addressing to 'NOP; do-literal-thing' and cross
CPU execution gets the ordering inverted.

Ooohh, but what if you go from !func to NOP.

assuming:

.literal = 0
BTI C
RET

Then

CPU0 CPU1

[S] literal = func [I] NOP
[S] insn[1] = NOP [L] x16 = literal (NULL)
b x16
*BANG*

Is that possible? (total lack of memory ordering etc..)

On IRC you just alluded to the fact that this relies on it all being in
a single cacheline (i-fetch windows don't need to be cacheline sized,
but provided they're at least 16 bytes, this should still work given the
alignment).

But is I$ and D$ coherent? One load is through I-fetch, the other is a
regular D-fetch.

However, Will has previously expressed reluctance to rely on such
things.

> + if (!WARN_ON(ret))
> + caches_clean_inval_pou((u64)tramp - 8, sizeof(insns));
> }