Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2()
From: Andy Lutomirski
Date: Fri Sep 20 2019 - 15:22:34 EST
On Fri, Sep 20, 2019 at 11:12 AM Willy Tarreau <w@xxxxxx> wrote:
>
> Hi Andy,
>
> On Fri, Sep 20, 2019 at 10:52:30AM -0700, Andy Lutomirski wrote:
> > 2. Fix what is arguably a straight up kernel bug, not even an ABI
> > issue: when a user program is blocking in getrandom(..., 0), the
> > kernel happily sits there doing absolutely nothing and deadlocks the
> > system as a result. This IMO isn't an ABI issue -- it's an
> > implementation problem. How about we make getrandom() (probably
> > actually wait_for_random_bytes()) do something useful to try to seed
> > the RNG if the system is otherwise not doing IO.
>
> I thought about it as well with my old MSDOS reflexes, but here I
> doubt we can do a lot. It seems fishy to me to start to fiddle with
> various drivers from within a getrandom() syscall, we could sometimes
> even end up waiting even longer because one device is already locked,
> and when we have access there there's not much we can do without
> risking to cause some harm. On desktop systems you have a bit more
> choice than on headless systems (blink keyboard leds and time the
> interrupts, run some disk accesses when there's still a disk, get a
> copy of the last buffer of the audio input and/or output, turn on
> the microphone and/or webcam, and collect some data). Many of them
> cannot always be used. We could do some more portable stuff like scan
> and hash the totality of the RAM. But that's all quite bad and
> unreliable and at this point it's better to tell userland "here's
> what I could get for you, if you want better, do it yourself" and the
> userland can then ask the user "dear user, I really need valid entropy
> this time to generate your GPG key, please type frantically on this
> keyboard". And it will be more reliable this way in my opinion.
Perhaps userland could register a helper that takes over and does
something better? But I think the kernel really should do something
vaguely reasonable all by itself. If nothing else, we want the ext4
patch that provoked this whole discussion to be applied, which means
that we need to unbreak userspace somehow, and returning garbage it to
is not a good choice.
Here are some possible approaches that come to mind:
int count;
while (crng isn't inited) {
msleep(1);
}
and modify add_timer_randomness() to at least credit a tiny bit to
crng_init_cnt.
Or we do something like intentionally triggering readahead on some
offset on the root block device. We should definitely not trigger
*blocking* IO.
Also, I wonder if the real problem preventing the RNG from staring up
is that the crng_init_cnt threshold is too high. We have a rather
baroque accounting system, and it seems like we can accumulate and
credit entropy for a very long time indeed without actually
considering ourselves done.
--Andy