Re: [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting

From: Ard Biesheuvel
Date: Wed Jun 17 2020 - 04:17:50 EST


On Tue, 16 Jun 2020 at 23:40, Will Deacon <will@xxxxxxxxxx> wrote:
>
> On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote:
> > When loading a module, module_frob_arch_sections() tries to figure out
> > the number of PLTs that'll be needed to handle all the RELAs. While
> > doing this, it tries to dedupe PLT allocations for multiple
> > R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> > R_AARCH64_JUMP26 relocations too.
> >
> > To make checks for duplicates easier/faster, it sorts the relocation
> > list by type, symbol and addend. That way, to check for a duplicate
> > relocation, it just needs to compare with the previous entry.
> >
> > However, sorting the entire relocation array is unnecessary and
> > expensive (O(n log n)) because there are a lot of other relocation types
> > that don't need deduping or can't be deduped.
> >
> > So this commit partitions the array into entries that need deduping and
> > those that don't. And then sorts just the part that needs deduping. And
> > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> > if it's disabled.
> >
> > This gives significant reduction in module load time for modules with
> > large number of relocations with no measurable impact on modules with a
> > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > enabled, the load time for one module went down from 268ms to 100ms.
> > Another module went down from 143ms to 83ms.
>
> Whilst I can see that's a significant relative saving, what proportion of
> actual boot time are we talking about here? It would be interesting to
> know if there are bigger potential savings elsewhere.
>

Also, 'some module' vs 'some other module' doesn't really say
anything. Please explain which modules and their sizes.

> > This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is
> > disabled because it looks like PLTs are not allocated for
> > R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled.
> >
> > Cc: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
> > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > ---
> > arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++-
> > 1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > index 65b08a74aec6..bf5118b3b828 100644
> > --- a/arch/arm64/kernel/module-plts.c
> > +++ b/arch/arm64/kernel/module-plts.c
> > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> > return ret;
> > }
> >
> > +static bool rela_needs_dedup(Elf64_Rela *rela)
> > +{
> > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
> > + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > +}
>

Would it help to check the section index here as well? Call/jump
instructions within a section are never sent through a PLT entry.

> Does this handle A53 erratum 843419 correctly? I'm worried that we skip
> the ADRP PLTs there.
>

ADRP PLTs cannot be deduplicated, as they incorporate a relative jump
back to the caller.

> > +
> > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
> > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
> > +{
> > + int i = 0, j = numrels - 1;
> > + Elf64_Rela t;
> > +
> > + while (i < j) {
> > + while (rela_needs_dedup(rela + i) && i < j)
> > + i++;
> > + while (!rela_needs_dedup(rela + j) && i < j)
> > + j--;
> > + if (i < j) {
> > + t = *(rela + j);
> > + *(rela + j) = *(rela + i);
> > + *(rela + i) = t;
> > + }
> > + }
>
> This is very hard to read and I think some of the 'i < j' comparisons are
> redundant. Would it make more sense to assign a temporary rather than
> post-inc/decrement and recheck?
>

Agreed.

Also, what's wrong with [] array indexing?