Re: [RFC] Randomness on confidential computing platforms
From: H. Peter Anvin
Date: Mon Jan 29 2024 - 17:13:37 EST
On January 29, 2024 1:17:07 PM PST, "H. Peter Anvin" <hpa@xxxxxxxxx> wrote:
>On January 29, 2024 1:04:23 PM PST, Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>>On 1/29/24 12:26, Kirill A. Shutemov wrote:
>>>>> Do we care?
>>>> I want to make sure I understand the scenario:
>>>>
>>>> 1. We're running in a guest under TDX (or SEV-SNP)
>>>> 2. The VMM (or somebody) is attacking the guest by eating all the
>>>> hardware entropy and RDRAND is effectively busted
>>>> 3. Assuming kernel-based panic_on_warn and WARN_ON() rdrand_long()
>>>> failure, that rdrand_long() never gets called.
>>> Never gets called during attack. It can be used before and after.
>>>
>>>> 4. Userspace is using RDRAND output in some critical place like key
>>>> generation and is not checking it for failure, nor mixing it with
>>>> entropy from any other source
>>>> 5. Userspace uses the failed RDRAND output to generate a key
>>>> 6. Someone exploits the horrible key
>>>>
>>>> Is that it?
>>> Yes.
>>
>>Is there something that fundamentally makes this a VMM vs. TDX guest
>>problem? If a malicious VMM can exhaust RDRAND, why can't malicious
>>userspace do the same?
>>
>>Let's assume buggy userspace exists. Is that userspace *uniquely*
>>exposed to a naughty VMM or is that VMM just added to the list of things
>>that can attack buggy userspace?
>
>The concern, I believe, is that a TDX guest is vulnerable as a *victim*, especially if the OS is being malicious.
>
>However, as you say a malicious user space including a conventional VM could try to use it to attack another. The only thing we can do in the kernel about that is to be resilient.
>
>Note that there is an option to the kernel to suspend boot until enough entropy has been gathered that predicting the output of the entropy pool in the kernel ought to be equivalent to breaking AES (in which case we have far worse problems.) To harden the VM case in general perhaps we should consider RDRAND to have zero entropy credit when used as a fallback for RDSEED.
>
So as far as I understand, the uncore bus (at least at the time RDRAND/RDSEED was designed) is a single-transaction bus; once a read transaction has been accepted by the bus the bus is locked until the reply is sent (like PCI.) As such, the RNG unit simply doesn't have to option of not returning a response without holding the whole uncore bus locked. However, I believe that if another core is waiting for the bus, that request will be served before the other core can return for more.
If the RNG bit source is crippled for some reason to the point of being near failure, it is certainly possible for a livelock to happen, but at least as far as I understand the likelihood of that happening enough to cause 16 failures in a row is so close to a total failure that it might be as well treated as one.
*Any* security sensitive application that doesn't take total RNG failure into account is fundamentally broken. *Any* hardware random number generator is inherently an analog device, and as such has a nonzero probability of failure. It has an integrity monitor, but all it can do is say "no" and not credit entropy, thereby slowing down and eventually stopping the unit (even RDRAND has a minimum seeding frequency guarantee, unlike /dev/urandom.)