Re: [PATCH] random: defer use of bootloader randomness to random_init()
From: Ard Biesheuvel
Date: Tue Jun 07 2022 - 10:19:46 EST
On Tue, 7 Jun 2022 at 14:22, Jason A. Donenfeld <Jason@xxxxxxxxx> wrote:
>
> Hi Ard,
>
> On Tue, Jun 07, 2022 at 02:15:27PM +0200, Ard Biesheuvel wrote:
> > Note that jump labels use asm() blocks, which are opaque to the
> > compiler, and so it is not guaranteed that codegen will be better than
>
> I actually spent a lot of time looking at the codegen on a few
> platforms.
>
I did a quick experiment on ThunderX2 with the following program
#include <stdio.h>
#include <stdlib.h>
#include <sys/random.h>
static unsigned char buf[16];
int main(void)
{
for (int i = 0; i < 1000000; i++) {
if (getrandom(buf, sizeof(buf),
GRND_RANDOM | GRND_NONBLOCK) < sizeof(buf)) {
fprintf(stderr, "getrandom() error!\n");
exit(-1);
}
}
return 0;
}
both with and without your revert patch of f5bda35fba615ac applied
onto v5.19-rc1, the results are below (Cortex-A57 @ 2 GHz):
############## WITH STATIC BRANCH
ard@dogfood:~$ perf_5.10 stat ./rnd
Performance counter stats for './rnd':
906.01 msec task-clock # 0.999 CPUs utilized
2 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
40 page-faults # 0.044 K/sec
1,812,010,034 cycles # 2.000 GHz
1,944,549,733 instructions # 1.07 insn per cycle
<not supported> branches
2,014,408 branch-misses
0.906695576 seconds time elapsed
0.140334000 seconds user
0.765871000 seconds sys
ard@dogfood:~$ perf_5.10 stat ./rnd
Performance counter stats for './rnd':
918.37 msec task-clock # 0.999 CPUs utilized
2 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
37 page-faults # 0.040 K/sec
1,836,733,451 cycles # 2.000 GHz
1,944,572,442 instructions # 1.06 insn per cycle
<not supported> branches
3,012,020 branch-misses
0.919027080 seconds time elapsed
0.159721000 seconds user
0.758718000 seconds sys
ard@dogfood:~$ perf_5.10 stat ./rnd
Performance counter stats for './rnd':
902.06 msec task-clock # 0.999 CPUs utilized
1 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
39 page-faults # 0.043 K/sec
1,804,103,600 cycles # 2.000 GHz
1,944,563,889 instructions # 1.08 insn per cycle
<not supported> branches
1,956,027 branch-misses
0.902793520 seconds time elapsed
0.172464000 seconds user
0.729822000 seconds sys
############## WITHOUT STATIC BRANCH
ard@dogfood:~$ perf_5.10 stat ./rnd
Performance counter stats for './rnd':
924.79 msec task-clock # 0.999 CPUs utilized
2 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
39 page-faults # 0.042 K/sec
1,849,568,681 cycles # 2.000 GHz
1,950,584,115 instructions # 1.05 insn per cycle
<not supported> branches
4,012,227 branch-misses
0.925500832 seconds time elapsed
0.204227000 seconds user
0.720739000 seconds sys
ard@dogfood:~$ perf_5.10 stat ./rnd
Performance counter stats for './rnd':
933.06 msec task-clock # 0.999 CPUs utilized
2 context-switches # 0.002 K/sec
0 cpu-migrations # 0.000 K/sec
39 page-faults # 0.042 K/sec
1,866,097,571 cycles # 2.000 GHz
1,950,574,944 instructions # 1.05 insn per cycle
<not supported> branches
3,984,008 branch-misses
0.933798032 seconds time elapsed
0.323067000 seconds user
0.610271000 seconds sys
ard@dogfood:~$ perf_5.10 stat ./rnd
Performance counter stats for './rnd':
913.16 msec task-clock # 0.999 CPUs utilized
1 context-switches # 0.001 K/sec
0 cpu-migrations # 0.000 K/sec
37 page-faults # 0.041 K/sec
1,826,308,530 cycles # 2.000 GHz
1,950,559,902 instructions # 1.07 insn per cycle
<not supported> branches
2,231,050 branch-misses
0.913874672 seconds time elapsed
0.164228000 seconds user
0.749157000 seconds sys
So that's a 0.3% reduction (in terms of actual instructions executed)
in a synthetic benchmark that does nothing but call getrandom() in a
tight loop.
In other words, I think the static branch solves a problem that does not exist.
> > > > - Why do we need to enable this static key so early?
> > >
> > > We don't need to enable it especially early. I've now sent three
> > > different approaches for deferring it until later and you suggested one.
> > > The first of mine is kind of ugly (checking static_key_initialized and
> > > such at different points). Your suggested one after that did the same
> > > but deferred into crng_reseed(), which I'm not a fan of. My second one
> > > is this patch, which is flawed for the reason you pointed out. But
> > > perhaps my third one is the right amount of simple and okay? That's the
> > > one I linked up top, [1]. Let me know what you think of that.
> > >
> > > My motivation for not wanting to defer it is that if the arch solution
> > > winds up being easy and straight forward (as it was for arm64), then it
> > > would be nice to not need to clutter up random.c as a result.
> >
> > If clutter is a concern, how about getting rid of the
> > execute_in_process_context() dance, and just use a late_initcall()
> > instead?
>
> As I already explained in [1], this does not work. If the order is
> (A)(B), then all this will happen *after* the late init call.
>
> [1] https://lore.kernel.org/lkml/Yp8oOH+9V336LrLk@xxxxxxxxx/
>
Yeah, I guess finding the right spot is tricky. The more reason just
to drop it altogether.