Re: clang asm-goto support (Was Re: [PATCH v2] x86/retpoline: Add clang support)
From: Peter Zijlstra
Date: Wed Feb 14 2018 - 05:35:50 EST
On Wed, Feb 14, 2018 at 09:52:59AM +0000, Yatsina, Marina wrote:
> Hi Peter,
>
> When I started the original thread last year I was in favor of adding
> "asm goto" and didn't understand why it wasn't done by that time. The
> feedback I got is that this feature (optimizing tracepoints) is very
> useful and that we do want it in llvm, but perhaps there's a cleaner
> way of implementing than "asm goto". An alternative suggestion arose
> as well.
So it's far more than just tracepoints. We use it all over the kernel to
do runtime branch patching.
One example is avoiding the scheduler preemption callbacks if we know
there are no users. This shaves a few % off a context switch
micro-bench.
But it is really _all_ over the place.
> I'm sure you can provide a lot of background for the decisions of why
> "asm goto" was chosen and which other alternatives were considered, as
> you were the one to implement this.
I have very little memories from back then, but it was mostly us asking
for label addresses in asm and them giving us asm-goto.
Using asm we can build our own primitives, and I realize the llvm
community doesn't like asm much, but then again, we treat C like a
glorified assembler and don't like our compilers too smart :-)
> Anyway, I think we should consider the alternatives and not take "asm
> goto" as a given.
Far too late for that, 7+ years ago when we did this was the time to
talk about alternatives, now we have this code base.
So we have the two jump_label things:
static __always_inline bool arch_static_branch(struct static_key *key, bool branch)
{
asm_volatile_goto("1:"
".byte " __stringify(STATIC_KEY_INIT_NOP) "\n\t"
".pushsection __jump_table, \"aw\" \n\t"
_ASM_ALIGN "\n\t"
_ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
".popsection \n\t"
: : "i" (key), "i" (branch) : : l_yes);
return false;
l_yes:
return true;
}
static __always_inline bool arch_static_branch_jump(struct static_key *key, bool branch)
{
asm_volatile_goto("1:"
".byte 0xe9\n\t .long %l[l_yes] - 2f\n\t"
"2:\n\t"
".pushsection __jump_table, \"aw\" \n\t"
_ASM_ALIGN "\n\t"
_ASM_PTR "1b, %l[l_yes], %c0 + %c1 \n\t"
".popsection \n\t"
: : "i" (key), "i" (branch) : : l_yes);
return false;
l_yes:
return true;
}
Where we emit either a 5 byte jump or a 5 byte nop and write a special
section with meta-data for the branch point.
You could possibly capture all that with a built-in, but would have to
exactly match our meta-data section and then we'd still be up some creek
without no paddle when we need to change it.
But we also have:
static __always_inline __pure bool _static_cpu_has(u16 bit)
{
asm_volatile_goto("1: jmp 6f\n"
"2:\n"
".skip -(((5f-4f) - (2b-1b)) > 0) * "
"((5f-4f) - (2b-1b)),0x90\n"
"3:\n"
".section .altinstructions,\"a\"\n"
" .long 1b - .\n" /* src offset */
" .long 4f - .\n" /* repl offset */
" .word %P[always]\n" /* always replace */
" .byte 3b - 1b\n" /* src len */
" .byte 5f - 4f\n" /* repl len */
" .byte 3b - 2b\n" /* pad len */
".previous\n"
".section .altinstr_replacement,\"ax\"\n"
"4: jmp %l[t_no]\n"
"5:\n"
".previous\n"
".section .altinstructions,\"a\"\n"
" .long 1b - .\n" /* src offset */
" .long 0\n" /* no replacement */
" .word %P[feature]\n" /* feature bit */
" .byte 3b - 1b\n" /* src len */
" .byte 0\n" /* repl len */
" .byte 0\n" /* pad len */
".previous\n"
".section .altinstr_aux,\"ax\"\n"
"6:\n"
" testb %[bitnum],%[cap_byte]\n"
" jnz %l[t_yes]\n"
" jmp %l[t_no]\n"
".previous\n"
: : [feature] "i" (bit),
[always] "i" (X86_FEATURE_ALWAYS),
[bitnum] "i" (1 << (bit & 7)),
[cap_byte] "m" (((const char *)boot_cpu_data.x86_capability)[bit >> 3])
: : t_yes, t_no);
t_yes:
return true;
t_no:
return false;
}
Which does something similar, but with a completely different meta-data
section and a different pre-patch fallback path.
But we also do things like:
#define __GEN_RMWcc(fullop, var, cc, clobbers, ...) \
do { \
asm_volatile_goto (fullop "; j" #cc " %l[cc_label]" \
: : [counter] "m" (var), ## __VA_ARGS__ \
: clobbers : cc_label); \
return 0; \
cc_label: \
return 1; \
} while (0)
#define GEN_UNARY_RMWcc(op, var, arg0, cc) \
__GEN_RMWcc(op " " arg0, var, cc, __CLOBBERS_MEM)
static __always_inline bool atomic_dec_and_test(atomic_t *v)
{
GEN_UNARY_RMWcc(LOCK_PREFIX "decl", v->counter, "%0", e);
}
In order to not generate crap asm with SETcc + TEST. Of course, the last
is superceded with asm-cc-output, which you _also_ don't support.
And I know you're going to tell me you guys would prefer it if we
switched to intrinsics for atomics, but then I'd have to tell you that
the C11 memory model doesn't match the Linux Kernel memory model [*],
another result of being late to the game. Also, we still support
compilers from before that.
So no, you're not going to give us something different.
[*] http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0124r4.html
So ideally, the compilers would actually commit to also implementing the
linux-kernel memory model, otherwise we'll be fighting the compiler
(like we have been for a while now) for even longer.
Esp. with LTO we run a real risk of the compiler doing BAD things to our
code.