Re: [PATCH] x86: use clang builtins to read and write eflags

From: Andy Lutomirski
Date: Mon Sep 14 2020 - 18:02:57 EST


On Mon, Sep 14, 2020 at 2:05 PM Nick Desaulniers
<ndesaulniers@xxxxxxxxxx> wrote:
>
> (Bill, please use `./scripts/get_maintainer.pl <patchfile>` to get the
> appropriate list of folks and mailing lists to CC)
>
> On Thu, Sep 3, 2020 at 8:06 AM Andy Lutomirski <luto@xxxxxxxxxx> wrote:
> >
> > So with my selftests hat on, the inline asm utterly sucks. Trying to
> > use pushfq / popfq violates the redzone and requires a gross hack to
> > work around. It also messes up the DWARF annotations. The kernel
> > gets a free pass on both of these because we don't use a redzone and
> > we don't use DWARF.
>
> Sorry, I don't fully understand:
> 1. What's the redzone?

Userspace ABI. x86_64 userspace is allowed to use 128 bytes below RSP.

> 2. How does pushfq/popfq violate it?

It clobbers a stack slot owned by the compiler.

> 3. What is the "gross hack"/"workaround?" (So that we might consider
> removing it if these builtins help).

Look in tools/testing/selftests/x86/helpers.h. I have a patch to
switch to the builtins.

> 4. The kernel does use DWARF, based on configs, right?

Indeed, but user code in the kernel tree (e.g. selftests) does.

> #ifdef CONFIG_X86_64
> #define __read_eflags __builtin_ia32_readeflags_u64
> #define __write_eflags __builtin_i32_writeeflags_u64
> #else
> #define __read_eflags __builtin_ia32_readeflags_u32
> #define __write_eflags __builtin_i32_writeeflags_u32
> #endif

Looks reasonable to me.

>
> Which would be concise. For smap_save() we can use clac() and stac()
> which might be nicer than `asm goto` (kudos for using `asm goto`, but
> plz no more). :^P Also, we can probably cleanup the note about GCC <
> 4.9 now. :)

I find it a bit implausible that popfq is faster than a branch, so the
smap_restore() code seems suboptimal. For smap_save(), I'm not sure
what's ideal, but the current code seems fine other than the bogus
constraint.

> > > Clang chooses the most general constraint when multiple constraints
> > > are specified. So it will use the stack when retrieving the flags.
> >
> > I haven't looked at clang's generated code to see if it's sensible
> > from an optimization perspective, but surely the kernel code right now
> > is genuinely wrong. GCC is free to omit the frame pointer and
> > generate a stack-relative access for the pop, in which case it will
> > malfunction.
>
> Sorry, this is another case I don't precisely follow, would you mind
> explaining it further to me please?

The compiler is permitted (and expected!) to instantiate an m
constraint as something like offset(%rsp). For example, this:

unsigned long bogus_read_flags(void)
{
unsigned long ret;
asm volatile ("pushfq\n\tpopq %0" : "=m" (ret));
return ret;
}

compiles to:

pushfq
popq -8(%rsp)
movq -8(%rsp), %rax
ret

which is straight-up wrong. Changing it to "=rm" gives:

pushfq
popq %rax
ret

which is correct, but this only works because gcc politely chose the r
option instead of the m option. clang chooses poorly and gives:

read_flags:
pushfq
popq -8(%rsp)
movq -8(%rsp), %rax
retq

which is wrong. I filed a clang bug:

https://bugs.llvm.org/show_bug.cgi?id=47530

but the kernel is buggy here -- clang is within its rights to generate
the bogus sequence above. Bill's email was IMO rather obfuscated, and
I assume this is what he meant.

Of course, clang unhelpfully generates poor code for the builtin, too:

nice_read_eflags:
pushq %rbp
movq %rsp, %rbp
pushfq
popq %rax
popq %rbp
retq

I filed a bug for that, too:

https://bugs.llvm.org/show_bug.cgi?id=47531

So we at least need to fix the bogus "=rm", and we should seriously
consider using the builtin.