Re: [PATCH] arm64: jump_label: use constraint "S" instead of "i"

From: Fangrui Song
Date: Thu Feb 01 2024 - 15:56:45 EST


On 2024-02-01, Dave Martin wrote:
On Thu, Feb 01, 2024 at 05:07:59PM +0100, Ard Biesheuvel wrote:
On Thu, 1 Feb 2024 at 12:50, Dave Martin <Dave.Martin@xxxxxxx> wrote:
>
> On Thu, Feb 01, 2024 at 01:11:20AM -0800, Fangrui Song wrote:
...
> >
> > Hmm. Clang's constraint "S" implementation doesn't support a constant
> > offset, so
> > `static_key_false(&nf_hooks_needed[pf][hook])` in include/linux/netfilter.h:nf_hook
> > would not compile with Clang <= 18.
> >
> > I have a patch https://github.com/llvm/llvm-project/pull/80255 , but
> > even if it is accepted and cherry-picked into the 18.x release branch,
> > if we still support older Clang, we cannot use "S" unconditionally.
> >
> >
> > So we probably need the following to prepare for -fPIE support in the
> > future:
> >
> > diff --git a/arch/arm64/include/asm/jump_label.h b/arch/arm64/include/asm/jump_label.h
> > index 48ddc0f45d22..b8af2f8b0c99 100644
> > --- a/arch/arm64/include/asm/jump_label.h
> > +++ b/arch/arm64/include/asm/jump_label.h
> > @@ -15,6 +15,16 @@
> > #define JUMP_LABEL_NOP_SIZE AARCH64_INSN_SIZE
> > +/*
> > + * Prefer "S" to support PIC. However, use "i" for Clang 18 and earlier as "S"
> > + * on a symbol with a constant offset is not supported.
> > + */
> > +#if defined(CONFIG_CC_IS_CLANG) && __clang_major__ <= 18
>
> Is a GCC version check needed? Or is the minimum GCC version specified
> for building the kernel new enough?
>
> > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "i"
> > +#else
> > +#define JUMP_LABEL_STATIC_KEY_CONSTRAINT "S"
> > +#endif
> > +

Can we use "Si" instead?

I thought the point was to avoid "S" on compilers that would choke on
it? If so, those compilers would surely choke on "Si" too, no?

"Si" is an invalid constraint. Unfortunately, GCC recognizes "S" as a
valid constraint and ignores trailing characters (asm_operand_ok). That
is, GCC would accept "Siiiii" as well...

The GCC support for "S" is great. The initial aarch64 port in 2012 supports "S".

> > static __always_inline bool arch_static_branch(struct static_key * const key,
> > const bool branch)
> > {
> > @@ -23,9 +33,9 @@ static __always_inline bool arch_static_branch(struct static_key * const key,
> > " .pushsection __jump_table, \"aw\" \n\t"
> > " .align 3 \n\t"
> > " .long 1b - ., %l[l_yes] - . \n\t"
> > - " .quad %c0 - . \n\t"
> > + " .quad %0 + %1 - . \n\t"
> > " .popsection \n\t"
> > - : : "i"(&((char *)key)[branch]) : : l_yes);
> > + : : JUMP_LABEL_STATIC_KEY_CONSTRAINT(key), "i"(branch) : : l_yes);
>
> While this looks reasonable, I'm still not clear now why the asm must do
> the addition.
>
> Can we roll the branch argument into the macro?
>
> "S"(symbol + constant) seems to do the right thing for AArch64, at least
> in GCC.
>

'symbol' is a struct static_key pointer, so adding '1' to it will not
produce the needed result.

I mean loosely things that resolve to something of the form
"symbol + constant" in the compiler output.

"S"(&((char *)key)[branch]) does indeed seem to do the right thing,
at least with GCC.

I probably didn't help by bikeshedding the way that expression was
written, apologies for that. It's orthogonal to what this patch is
doing.

Yes, "S"(&((char *)key)[branch]) would do the right thing.
I have compared assembly output. It's a matter of "s" vs "s + 0" and "s+1" vs "s + 1".


...
> > > > >If there's another reason why "S" is advantageous though, I'm happy to
> > > > >be corrected.
> > > >
> > > > I remember that Ard has an RFC
> > > > https://lore.kernel.org/linux-arm-kernel/20220427171241.2426592-1-ardb@xxxxxxxxxx/
> > > > "[RFC PATCH 0/2] arm64: use PIE code generation for KASLR kernel"
> > > > and see some recent PIE codegen patches.
> > > >
> > > > > Building the KASLR kernel without -fpie but linking it with -pie works
> > > > > in practice, but it is not something that is explicitly supported by the
> > > > > toolchains - it happens to work because the default 'small' code model
> > > > > used by both GCC and Clang relies mostly on ADRP+ADD/LDR to generate
> > > > > symbol references.
>
> Does this mean that doing boot-time relocation on a fully statically
> linked kernel doesn't bring much benefit?

Not sure I follow you here. The boot-time relocation is necessary in
any case, to fix up statically initialized pointer variables all over
the kernel.

> It would tend to be more
> painful to do the relocation work, and mean that the kernel text has to
> be temporarily writeable, but static linking should have the best
> runtime performance.
>

Not sure what you mean by 'static linking' here.

The only thing PIE linking does in the case of the kernel is
a) complain if static relocations are used that cannot be fixed up at
runtime (e.g., movk/movz sequences)
b) emit dynamic relocations that the early startup code can use to fix
up all statically initialized pointers

From a codegen point-of-view, there is no difference because we don't
use -fpic code. The problem with -fpic codegen is that it makes some
assumptions that only hold for shared ELF objects, e.g., avoiding
dynamic relocations in .text, using GOT entries even for variables
defined in the same compilation so that they can be preempted, keeping
all runtime relocatable quantities together so the CoW footprint of
the shared library is minimized.

None of this matters for the kernel, and as it turns out, the non-PIC
small C model produces code that the PIE linker is happy with.

TL;DR our code (including inline and out-of-line asm) is already PIC,
and this property is relied upon by KASLR.

> Maybe it doesn't make as much difference as I thought (certainly ADRP
> based addressing should make a fair amount of the PIC/PIE overhead go
> away).
>

The PIC/PIE overhead is in the indirections via the GOT. However,
recent linkers will actually perform some relaxations too: if a symbol
is defined in the same executable and is not preemptible, a GOT load

ADRP
LDR

can be transparently converted to

ADRP
ADD

referring to the symbol directly. This is also where hidden.h comes
in: making all symbol declarations and definitions 'hidden' will allow
the compiler to assume that a GOT entry is never needed.

Is there a reason why we don't just build the whole kernel with
-fvisibility=hidden today?

This topic, loosely related to this patch, is about switching to PIC
compiles. I am not familiar with the Linux kernel, so I'll mostly leave
the discussion to you and Ard :)

That said, I have done a lot of Clang work in visibility/symbol
preemptibility and linkers, so I am happy to add what I know.

-fvisibility=hidden only applies to definitions, not non-definition
declarations.

I've mentioned this at
https://lore.kernel.org/all/20220429070318.iwj3j5lpfkw4t7g2@xxxxxxxxxx/

`#pragma GCC visibility push(hidden)` is very similar to -fvisibility=hidden -fdirect-access-external-data with Clang.
...
The kernel uses neither TLS nor -fno-plt, so -fvisibility=hidden
-fdirect-access-external-data can replace `#pragma GCC visibility
push(hidden)`.

So building with -fPIC is currently not need in practice, and creates
some complications, which is why we have been avoiding it. But PIE
linking non-PIC objects is not guaranteed to remain supported going
forward, so it would be better to have a plan B, i.e., being able to
turn on -fpic without massive regressions due to GOT overhead, or
codegen problems with our asm hacks.

Summarising all of this is it right that:

1) ld -pie is how we get the reloc info into the kernel for KASLR
today.

2) We use gcc -fno-pic today, but this might break with ld -pie in the
future, although it works for now.

3) gcc -fno-pic and gcc -fPIC (or -fPIE) generate almost the same code,
assuming we tweak symbol visibility and use a memory model that
ADR+ADD/LDR can span. So, moving to -fPIE is likely to be do-able.


My point is that an alternative option would be to move to ld -no-pie.
We would need another way to get relocs into the kernel, such as an
intermediate step with ld --emit-relocs. I have definitely seen this
done before somewhere, but it would be extra work and not necessarily
worth it, based on what you say about code gen.

This may all have been discussed to death already -- I didn't mean to
hijack this patch, so I'll stop digging here, but if I've misunderstood,
please shout!

Cheers
---Dave