Re: [PATCH 2/4] arm64: implement support for static call trampolines

From: Mark Rutland
Date: Mon Sep 27 2021 - 04:59:22 EST


On Sat, Sep 25, 2021 at 05:46:23PM +0000, David Laight wrote:
> From: Mark Rutland
> > Sent: 21 September 2021 17:28
> >
> > On Tue, Sep 21, 2021 at 05:55:11PM +0200, Ard Biesheuvel wrote:
> > > On Tue, 21 Sept 2021 at 17:33, Mark Rutland <mark.rutland@xxxxxxx> wrote:
> > > >
> > > > On Tue, Sep 21, 2021 at 04:44:56PM +0200, Ard Biesheuvel wrote:
> > > > > On Tue, 21 Sept 2021 at 09:10, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> > > ...
> ...
> > > >
> > > > I think so, yes. We can do sligntly better with an inline literal pool
> > > > and a PC-relative LDR to fold the ADRP+LDR, e.g.
> > > >
> > > > .align 3
> > > > tramp:
> > > > BTI C
> > > > {B <func> | RET | NOP}
> > > > LDR X16, 1f
> > > > BR X16
> > > > 1: .quad <literal>
> > > >
> > > > Since that's in the .text, it's RO for regular accesses anyway.
> > > >
> > >
> > > I tried to keep the literal in .rodata to avoid inadvertent gadgets
> > > and/or anticipate exec-only mappings of .text, but that may be a bit
> > > overzealous.
> >
> > I think that in practice the risk of gadgetisation is minimal, and
> > having it inline means we only need to record a single address per
> > trampoline, so there's less risk that we get the patching wrong.
>
> But doesn't that mean that it is almost certainly a data cache miss?
> You really want an instruction that reads the constant from the I-cache.
> Or at least be able to 'bunch together' the constants so they
> stand a chance of sharing a D-cache line.

The idea is that in the common case we don't even use the literal, and
the `B <func>` goes to the target.

The literal is there as a fallback for when the target is a sufficiently
long distance away (more than +/-128MiB from the `BR X16`). By default
we try to keep modules within 128MiB of the kernel image, and this
should only happen in uncommon configs (e.g. my debug kernel configs
when the kernel can be 100s of MiBs).

With that in mind, I'd strongly prefer to optimize for simplicity rather
than making the uncommon case faster.

Thanks,
Mark.