Re: [PATCH RFC v4 1/1] random: WARN on large getrandom() waits and introduce getrandom2()

From: Andy Lutomirski
Date: Fri Sep 20 2019 - 16:51:54 EST


On Fri, Sep 20, 2019 at 12:51 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> > And the only real question is how to map existing users to these
> > semantics. I see two sensible choices:
> >
> > 1. 0 means "secure, blocking". I think this is not what we'd do if we
> > could go back in time and chage the ABI from day 1, but I think it's
> > actually good enough. As long as this mode won't deadlock, it's not
> > *that* bad if programs are using it when they wanted "insecure".
>
> It's exactly that "as long as it won't deadlock" that is our current problem.
>
> It *does* deadlock.
>
> So it can't mean "blocking" in any long-term meaning.
>
> It can mean "blocks for up to 15 seconds" or something like that. I'd
> honestly prefer a smaller number, but I think 15 seconds is an
> acceptable "your user space is buggy, but we won't make you think the
> machine hung".

To be clear, when I say "blocking", I mean "blocks until we're ready,
but we make sure we're ready in a moderately timely manner".

Rather than answering everything point by point, here's a updated
mini-proposal and some thoughts. There are two families of security
people that I think we care about. One is the FIPS or CC or PCI
crowd, and they might, quite reasonably, demand actual hardware RNGs.
We should make the hwrng API stop sucking and they should be happy.
(This means expose an hwrng device node per physical device, IMO.)
The other is the one who wants getrandom(), etc to be convincingly
secure and is willing to do some actual analysis. And I think we can
make them quite happy like this:

In the kernel, we have two types of requests for random numbers: a
request for "secure" bytes and a request for "insecure" bytes.
Requests for "secure" bytes can block or return -EAGAIN. Requests for
"insecure" bytes succeed without waiting. In addition, we have a
jitter entropy mechanism (maybe the one mjg59 referenced, maybe
Alexander's -- doesn't really matter) and we *guarantee* that jitter
entropy, by itself, is enough to get the "secure" generator working
after, say, 5s of effort. By this, I mean that, on an idle system, it
finishes in 5s and, on a fully loaded system, it's allowed to take a
little while longer but not too much longer.

In other words, I want GRND_SECURE_BLOCKING and /dev/random reads to
genuinely always work and to genuinely never take much longer than 5s.
I don't want a special case where they fail.

The exposed user APIs are, subject to bikeshedding that can happen
later over the actual values, etc:

GRND_SECURE_BLOCKING: returns "secure" output and blocks until it's
ready. This never fails, but it also never blocks forever.

GRND_SECURE_NONBLOCKING: same but returns -EAGAIN instead of blocking.

GRND_INSECURE: returns "insecure" output immediately. I think we do
need this -- the "secure" mode may take a little while at early boot,
and libraries that initialize themselves with some randomness really
do want a way to get some numbers without any delay whatsoever.

0: either the same as GRND_SECURE_BLOCKING plus a warning or the
"accelerated" version. The "accelerated" version means wait up to 2s
for secure numbers and, if there still aren't any, fall back to
"insecure".

GRND_RANDOM: either the same as 0 or the same as GRND_SECURE_BLOCKING
but with a warning. I don't particularly care either way.

I'm okay with a well-defined semantic like I proposed for an
accelerated mode. I don't really want to try to define what a
secure-but-not-as-secure mode means as a separate complication that
the underlying RNG needs to support forever. I don't think the
security folks would like that either.

How does this sound?