RE: [PATCH 2/2] x86/random: Issue a warning if RDRAND or RDSEED fails

From: Reshetova, Elena
Date: Thu Feb 01 2024 - 02:26:31 EST


> On Wed, Jan 31, 2024 at 02:06:13PM +0100, Jason A. Donenfeld wrote:
>
> Hi again to everyone, beautiful day here in North Dakota.
>
> > On Wed, Jan 31, 2024 at 9:17???AM Reshetova, Elena
> > <elena.reshetova@xxxxxxxxx> wrote:
> > > This matches both my understanding (I do have cryptography background
> > > and understanding how cryptographic RNGs work)
> > > and official public docs that Intel published on this matter.
> > > Given that the physical entropy source is limited anyhow, and by giving
> > > enough pressure on the whole construction you should be able to
> > > make RDRAND fail because if the intermediate AES-CBC MAC extractor/
> > > conditioner is not getting its min entropy input rate, it wont
> > > produce a proper seed for AES CTR DRBG.
> > > Of course exact details/numbers can wary between different generations of
> > > Intel DRNG implementation, and the platforms where it is running on,
> > > so be careful to sticking to concrete numbers.
>
> > Alright, so RDRAND is not reliable. The question for us now is: do
> > we want RDRAND unreliability to translate to another form of
> > unreliability elsewhere, e.g. DoS/infiniteloop/latency/WARN_ON()? Or
> > would it be better to declare the hardware simply broken and ask
> > Intel to fix it? (I don't know the answer to that question.)
>
> I think it would demonstrate a lack of appropriate engineering
> diligence on the part of our community to declare RDRAND 'busted' at
> this point.
>
> While it appeares to be trivially easy to force RDSEED into depletion,
> there does not seem to be a suggestion, at least in the open
> literature, that this directly or easily translates into stalling
> output from RDRAND in any type of relevant adversarial fashion.
>
> If this were the case, given what CVE's seem to be worth on a resume,
> someone would have rented a cloud machine and come up with a POC
> against RDRAND in a multi-tenant environment and then promptly put up
> a web-site called 'Random Starve' or something equally ominous.
>
> This is no doubt secondary to the 1022x amplication factor inherent in
> the 'Bull Mountain' architecture.
>
> I'm a bit surprised that no one from the Intel side of this
> conversation didn't pitch this over the wall as soon as this
> conversation came up, but I would suggest that everyone concerned
> about this issue give the following a thorough read:
>
> https://www.intel.com/content/www/us/en/developer/articles/guide/intel-digital-
> random-number-generator-drng-software-implementation-guide.html
>
> Relevant highlights:
>
> - As I suggested in my earlier e-mail, random number generation is a
> socket based resource, hence an adversarial domain limited to only
> the cores on a common socket.
>
> - There is a maximum randomness throughput rate of 800 MB/s over all
> cores sharing common random number infrastructure. Single thread
> throughput rates of 70-200 MB/s are demonstratable.
>
> - A failure of RDRAND over 10 re-tries is 'astronomically' small, with
> no definition of astronomical provided, one would assume really
> small, given they are using the word astronomical.

As I said, I want to investigate this properly before stating anything.
In a CoCo VM we cannot guarantee that a victim guest is able to execute
this 10 re-try loop (there is also a tightness requirement listed in official
guide that is not further specified) without interruption since all guest
scheduling is under the host control. Again, this is the angle that was not
present before and I want to make sure we are protected against this case.

>
> > > That said, I have taken an AR to follow up internally on what can be done
> > > to improve our situation with RDRAND/RDSEED.
>
> I think I can save you some time Elena.
>
> > Specifying this is an interesting question. What exactly might our
> > requirements be for a "non-broken" RDRAND? It seems like we have two
> > basic ones:
> >
> > - One VMX (or host) context can't DoS another one.
> > - Ring 3 can't DoS ring 0.
> >
> > I don't know whether that'd be implemented with context-tied rate
> > limiting or more state or what. But I think, short of just making
> > RDRAND never fail, that's basically what's needed.
>
> I think we probably have that, for all intents and purposes, given
> that we embrace the following methodogy:
>
> - Use RDRAND exclusively.
>
> - Be willing to take 10 swings at the plate.
>
> - Given the somewhat demanding requirements for TDX/COCO, fail and
> either deadlock or panic after 10 swings since that would seem to
> suggest the hardware is broken, ie. RMA time.

Again, my worry here that a CoCo guest is not in control of its own scheduling
and this might make an impact on the above statement, i.e. it might
theoretical be possible to cause this without physically broken HW.

Best Regards,
Elena.

>
> Either deadlock or panic would be appropriate. The objective in the
> COCO environment is to get the person who clicked on the 'Enable Azure
> Confidential' checkbox, or its equivalent, on their cloud dashboard,
> to call the HelpDesk and ask them why their confidential application
> won't come up.
>
> After the user confirms to the HelpDesk that their computer is plugged
> in, the problem will get fixed. Either the broken hardware will be
> identified and idled out or the mighty sword of vengeance will be
> summoned down on whoever has all of the other cores on the socket
> pegged.
>
> Final thoughts:
>
> - RDSEED is probably a poor thing to be using.
>
> - There may be a reasonable argument that RDSEED shouldn't have been
> exposed above ring 0, but that ship has sailed. Brownie points
> moving forward for an RDsomething that is ring 0 and has guaranteed
> access to some amount of functionally reasonable entropy.
>
> - Intel and AMD are already doing a lot of 'special' stuff with their
> COCO hardware in order to defy the long standing adage of: 'You
> can't have security without physical security'. Access to per core thermal
> noise, as I suggested, is probably a big lift but clever engineers can
> probably cook up some type of fairness doctrine for randomness in
> TDX or SEV_SNP, given the particular importance of instruction based
> randomness in COCO.
>
> - Perfection is the enemy of good.
>
> > Jason
>
> Have a good day.
>
> As always,
> Dr. Greg
>
> The Quixote Project - Flailing at the Travails of Cybersecurity
> https://github.com/Quixote-Project