Re: [PATCH 2/2] x86/random: Issue a warning if RDRAND or RDSEED fails

From: Dr. Greg
Date: Thu Feb 01 2024 - 05:58:05 EST


On Thu, Feb 01, 2024 at 07:26:15AM +0000, Reshetova, Elena wrote:

Good morning to everyone.

> > On Wed, Jan 31, 2024 at 02:06:13PM +0100, Jason A. Donenfeld wrote:
> >
> > Hi again to everyone, beautiful day here in North Dakota.
> >
> > > On Wed, Jan 31, 2024 at 9:17???AM Reshetova, Elena
> > > <elena.reshetova@xxxxxxxxx> wrote:
> > > > This matches both my understanding (I do have cryptography background
> > > > and understanding how cryptographic RNGs work)
> > > > and official public docs that Intel published on this matter.
> > > > Given that the physical entropy source is limited anyhow, and by giving
> > > > enough pressure on the whole construction you should be able to
> > > > make RDRAND fail because if the intermediate AES-CBC MAC extractor/
> > > > conditioner is not getting its min entropy input rate, it wont
> > > > produce a proper seed for AES CTR DRBG.
> > > > Of course exact details/numbers can wary between different generations of
> > > > Intel DRNG implementation, and the platforms where it is running on,
> > > > so be careful to sticking to concrete numbers.
> >
> > > Alright, so RDRAND is not reliable. The question for us now is: do
> > > we want RDRAND unreliability to translate to another form of
> > > unreliability elsewhere, e.g. DoS/infiniteloop/latency/WARN_ON()? Or
> > > would it be better to declare the hardware simply broken and ask
> > > Intel to fix it? (I don't know the answer to that question.)
> >
> > I think it would demonstrate a lack of appropriate engineering
> > diligence on the part of our community to declare RDRAND 'busted' at
> > this point.
> >
> > While it appeares to be trivially easy to force RDSEED into depletion,
> > there does not seem to be a suggestion, at least in the open
> > literature, that this directly or easily translates into stalling
> > output from RDRAND in any type of relevant adversarial fashion.
> >
> > If this were the case, given what CVE's seem to be worth on a resume,
> > someone would have rented a cloud machine and come up with a POC
> > against RDRAND in a multi-tenant environment and then promptly put up
> > a web-site called 'Random Starve' or something equally ominous.
> >
> > This is no doubt secondary to the 1022x amplication factor inherent in
> > the 'Bull Mountain' architecture.
> >
> > I'm a bit surprised that no one from the Intel side of this
> > conversation didn't pitch this over the wall as soon as this
> > conversation came up, but I would suggest that everyone concerned
> > about this issue give the following a thorough read:
> >
> > https://www.intel.com/content/www/us/en/developer/articles/guide/intel-digital-
> > random-number-generator-drng-software-implementation-guide.html
> >
> > Relevant highlights:
> >
> > - As I suggested in my earlier e-mail, random number generation is a
> > socket based resource, hence an adversarial domain limited to only
> > the cores on a common socket.
> >
> > - There is a maximum randomness throughput rate of 800 MB/s over all
> > cores sharing common random number infrastructure. Single thread
> > throughput rates of 70-200 MB/s are demonstratable.
> >
> > - A failure of RDRAND over 10 re-tries is 'astronomically' small, with
> > no definition of astronomical provided, one would assume really
> > small, given they are using the word astronomical.

> As I said, I want to investigate this properly before stating
> anything. In a CoCo VM we cannot guarantee that a victim guest is
> able to execute this 10 re-try loop (there is also a tightness
> requirement listed in official guide that is not further specified)
> without interruption since all guest scheduling is under the host
> control. Again, this is the angle that was not present before and I
> want to make sure we are protected against this case.

I suspect that all of this may be the source of interesting
discussions inside of Intel, see my closing question below.

If nothing else, we will wait with baited breath for a definition of
astronomical, if of course, the definition of that value is
unprivileged and you would be free to forward it along... :-)

> > > > That said, I have taken an AR to follow up internally on what can be done
> > > > to improve our situation with RDRAND/RDSEED.
> >
> > I think I can save you some time Elena.
> >
> > > Specifying this is an interesting question. What exactly might our
> > > requirements be for a "non-broken" RDRAND? It seems like we have two
> > > basic ones:
> > >
> > > - One VMX (or host) context can't DoS another one.
> > > - Ring 3 can't DoS ring 0.
> > >
> > > I don't know whether that'd be implemented with context-tied rate
> > > limiting or more state or what. But I think, short of just making
> > > RDRAND never fail, that's basically what's needed.
> >
> > I think we probably have that, for all intents and purposes, given
> > that we embrace the following methodogy:
> >
> > - Use RDRAND exclusively.
> >
> > - Be willing to take 10 swings at the plate.
> >
> > - Given the somewhat demanding requirements for TDX/COCO, fail and
> > either deadlock or panic after 10 swings since that would seem to
> > suggest the hardware is broken, ie. RMA time.

> Again, my worry here that a CoCo guest is not in control of its own
> scheduling and this might make an impact on the above statement,
> i.e. it might theoretical be possible to cause this without
> physically broken HW.

So all of this leaves open a very significant question that would seem
to be worthy of further enlightenment from inside the bowels of
Intel engineering.

Our discussion has now led us to a point where there appears to be a
legitimate concern that the hypervisor has such significant control
over a confidential VM that the integrity of a simple re-try loop is
an open question.

Let us posit for argument, that confidential computing resolves down
to the implementation of a trusted computing platform that in turn
resolves to a requirement for competent and robust cryptography for
initial and ongoing attestation, let alone confidentiality in the face
of possible side-channel and timing attacks.

I'm sure there would be a great deal of interest in any information
that can be provided that this scenario is possible, given the level
of control that is being suggested that a hypervisor would enjoy over
an ostensibly confidential and trusted guest.

> Best Regards,
> Elena.

Have a good day.

As always,
Dr. Greg

The Quixote Project - Flailing at the Travails of Cybersecurity
https://github.com/Quixote-Project