Re: [PATCH v1] arm64/module: Optimize module load time by optimizing PLT counting

From: Saravana Kannan
Date: Tue Jun 16 2020 - 23:38:42 EST


On Tue, Jun 16, 2020 at 2:40 PM Will Deacon <will@xxxxxxxxxx> wrote:
>
> On Fri, Jun 05, 2020 at 03:22:57PM -0700, Saravana Kannan wrote:
> > When loading a module, module_frob_arch_sections() tries to figure out
> > the number of PLTs that'll be needed to handle all the RELAs. While
> > doing this, it tries to dedupe PLT allocations for multiple
> > R_AARCH64_CALL26 relocations to the same symbol. It does the same for
> > R_AARCH64_JUMP26 relocations too.
> >
> > To make checks for duplicates easier/faster, it sorts the relocation
> > list by type, symbol and addend. That way, to check for a duplicate
> > relocation, it just needs to compare with the previous entry.
> >
> > However, sorting the entire relocation array is unnecessary and
> > expensive (O(n log n)) because there are a lot of other relocation types
> > that don't need deduping or can't be deduped.
> >
> > So this commit partitions the array into entries that need deduping and
> > those that don't. And then sorts just the part that needs deduping. And
> > when CONFIG_RANDOMIZE_BASE is disabled, the sorting is skipped entirely
> > because PLTs are not allocated for R_AARCH64_CALL26 and R_AARCH64_JUMP26
> > if it's disabled.
> >
> > This gives significant reduction in module load time for modules with
> > large number of relocations with no measurable impact on modules with a
> > small number of relocations. In my test setup with CONFIG_RANDOMIZE_BASE
> > enabled, the load time for one module went down from 268ms to 100ms.
> > Another module went down from 143ms to 83ms.
>
> Whilst I can see that's a significant relative saving, what proportion of
> actual boot time are we talking about here? It would be interesting to
> know if there are bigger potential savings elsewhere.

100ms if pretty big in terms of boot time from a phone OEM
perspective. It adds up. And for these two modules I profiled, it adds
up to 228 ms. That's quite a lot.

Also, if you look at just the kernel init time and all the module load
time, it comes to around 2.2 seconds in the hardware I'm testing on.
That's a 10% reduction in the "kernel init" time. Also, this 2.2
seconds is without async probing. If we do that, this becomes and even
larger % of the kernel init time.

> > This commit also disables the sorting if CONFIG_RANDOMIZE_BASE is
> > disabled because it looks like PLTs are not allocated for
> > R_AARCH64_CALL26 and R_AARCH64_JUMP26 if it's disabled.
> >
> > Cc: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
> > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > ---
> > arch/arm64/kernel/module-plts.c | 37 ++++++++++++++++++++++++++++++++-
> > 1 file changed, 36 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/kernel/module-plts.c b/arch/arm64/kernel/module-plts.c
> > index 65b08a74aec6..bf5118b3b828 100644
> > --- a/arch/arm64/kernel/module-plts.c
> > +++ b/arch/arm64/kernel/module-plts.c
> > @@ -253,6 +253,36 @@ static unsigned int count_plts(Elf64_Sym *syms, Elf64_Rela *rela, int num,
> > return ret;
> > }
> >
> > +static bool rela_needs_dedup(Elf64_Rela *rela)
> > +{
> > + return ELF64_R_TYPE(rela->r_info) == R_AARCH64_JUMP26
> > + || ELF64_R_TYPE(rela->r_info) == R_AARCH64_CALL26;
> > +}
>
> Does this handle A53 erratum 843419 correctly? I'm worried that we skip
> the ADRP PLTs there.

So this isn't really skipping any PLTs or RELAs. It's just picking the
RELAs that can benefit from deduping using sort. Even if I delete the
entire sort() call, all the modules load and the device seems to be
working fine (Video streaming over Wifi works). The catch there is
that the module takes up more memory (because the PLTs are duplicated
often) and you might potentially lose some caching benefits for PLTs
that have a significant amount of repetition (say, printk or something
like that).

You'll see that the duplicate_rel() call isn't even called in any code
related to that erratum. So, I think this is safe.

> > +
> > +/* Group the CALL26/JUMP26 relas toward the beginning of the array. */
> > +static int partition_dedup_relas(Elf64_Rela *rela, int numrels)
> > +{
> > + int i = 0, j = numrels - 1;
> > + Elf64_Rela t;
> > +
> > + while (i < j) {
> > + while (rela_needs_dedup(rela + i) && i < j)
> > + i++;
> > + while (!rela_needs_dedup(rela + j) && i < j)
> > + j--;
> > + if (i < j) {
> > + t = *(rela + j);
> > + *(rela + j) = *(rela + i);
> > + *(rela + i) = t;
> > + }
> > + }
>
> This is very hard to read and I think some of the 'i < j' comparisons are
> redundant.

I have no strong opinion on how this function is written. It's a
trivial partitioning code. I'll be happy to copy paste any clean up
anyone wants to suggest. I think all the i < j checks are necessary
for my current implementation.

> Would it make more sense to assign a temporary rather than
> post-inc/decrement and recheck?

I'm not sure I get what you are trying to say.

-Saravana