Re: [PATCH 1/2] x86/random: Retry on RDSEED failure

From: H. Peter Anvin
Date: Thu Feb 01 2024 - 14:03:19 EST


On February 1, 2024 10:46:06 AM PST, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>On 2/1/24 10:09, Jason A. Donenfeld wrote:
>> Question ii) Just how DoS-able is RDRAND? From host to guest, where
>> the host controls scheduling, that seems easier, but how much so, and
>> what's the granularity of these operations, and could retries still
>> help, or not at all? What about from guest to guest, where the
>> scheduling is out of control; in that case is there a value of N for
>> which N retries makes it actually impossible to DoS? What about from
>> userspace to kernelspace; good value of N?
>
>So far, in practice, I haven't seen a single failure of RDRAND. It's
>been limited to RDSEED. In a perfect world, I'd change the architecture
>docs to say, "RDRAND only fails when the hardware breaks" and leave
>RDSEED defined to be the one that fails easily.
>
>Dealing with a fragile RDSEED seems like a much easier problem than
>dealing with a fragile RDRAND since RDSEED is used _much_ more sparingly
>in the kernel today.
>
>But I'm not sure if the hardware implementations fit into this perfect
>world I've conjured up. We're going to wrangle up the folks at Intel
>who can hopefully tell me if I'm totally deluded.
>
>Has anyone seen RDRAND failures in practice? Or just RDSEED?
>
>> Question iii) How likely is Intel to actually fix this in a
>> satisfactory way (see "specifying this is an interesting question" in
>> [1])? And if they would, what would the timeline even be?
>
>If the fix is pure documentation, it's on the order of months. I'm
>holding out hope that some kind of anti-DoS claims like you mentioned:
>
>> Specifying this is an interesting question. What exactly might our
>> requirements be for a "non-broken" RDRAND? It seems like we have two
>> basic ones:
>>
>> - One VMX (or host) context can't DoS another one.
>> - Ring 3 can't DoS ring 0.
>
>are still possible on existing hardware, at least for RDRAND.

The real question is: what do we actually need?

During startup, we could afford a *lot* of looping to collect enough entropy before giving up. After that, even if RDSEED fails 99% of the time, it will still produce far more entropy than a typical external randomness source. We don't want to loop that long, obviously (*), but instead try periodically and let the entropy accumulate.

(*) We *could* of course choose to aggressively loop in task context if there task would otherwise block on /dev/random.