Re: [PATCH 2/2] x86/random: Issue a warning if RDRAND or RDSEED fails
From: Dr. Greg
Date: Thu Feb 08 2024 - 05:36:45 EST
On Tue, Feb 06, 2024 at 01:00:03PM +0000, Daniel P. Berrang?? wrote:
Good morning.
> On Tue, Feb 06, 2024 at 06:04:45AM -0600, Dr. Greg wrote:
> > On Tue, Feb 06, 2024 at 08:04:57AM +0000, Daniel P. Berrang?? wrote:
> >
> > Good morning to everyone.
> >
> > > On Mon, Feb 05, 2024 at 07:12:47PM -0600, Dr. Greg wrote:
> > > >
> > > > Actually, I now believe there is clear evidence that the problem is
> > > > indeed Intel specific. In light of our testing, it will be
> > > > interesting to see what your 'AR' returns with respect to an official
> > > > response from Intel engineering on this issue.
> > > >
> > > > One of the very bright young engineers collaborating on Quixote, who
> > > > has been following this conversation, took it upon himself to do some
> > > > very methodical engineering analysis on this issue. I'm the messenger
> > > > but this is very much his work product.
> > > >
> > > > Executive summary is as follows:
> > > >
> > > > - No RDRAND depletion failures were observable with either the Intel
> > > > or AMD hardware that was load tested.
> > > >
> > > > - RDSEED depletion is an Intel specific issue, AMD's RDSEED
> > > > implementation could not be provoked into failure.
> >
> > > My colleague ran a multithread parallel stress test program on his
> > > 16core/2HT AMD Ryzen (Zen4 uarch) and saw a 80% failure rate in
> > > RDSEED.
> >
> > Interesting datapoint, thanks for forwarding it along, so the issue
> > shows up on at least some AMD platforms as well.
> >
> > On the 18 core/socket Intel Skylake platform, the parallelized
> > depletion test forces RDSEED success rates down to around 2%. It
> > would appear that your tests suggest that the AMD platform fairs
> > better than the Intel platform.
> Yes, given the speed of the AMD RDRAND/RDSEED ops, compared to my
> Intel test platforms, their DRBG looks better able to keep up with
> the demand for bits.
We now believe the observed resiliency of AMD's RNG infrastructure
comes down to the fact that the completion times of their RNG
instructions are significantly slower than Intel's.
SkyLake and KabyLake instruction completion times are documented at
463 clock cycles, regardless of operand size.
AMD Ryzen documents variable completion times based on operand size.
16 and 32 bit transfers complete in 1200 clock cycles with 64 bit
requests completing in 2500 clock cycles.
Given that Jason's test program was issueing 64-bit RNG requests, the
AMD platforms are going to be approximately 5.4 times slower than
Intel platforms, provided the results are corrected for CPU clock
rates.
AMD's entropy source is execution jitter time over a bank of inverter
based ring oscillors, presumably sampled by a constant clock rate
sampler. Slower instruction retirement times consumes less of the
constant rate entropy production.
Intel uses thermal/quantum noise across a diode junction retrieved by
a self-clocked sampler. Faster instruction retirement translates into
increased bandwidth demands on the sampler.
> > Of course, the other variable may be how the parallelized stress test
> > is conducted. If you would like to share your implementation source
> > we could give it a twirl on the systems we have access to.
>
> It is just Jason's earlier test program, but moved into one thread
> for each core....
>
> $ cat cpurngstress.c
> #include <stdio.h>
> #include <immintrin.h>
> #include <pthread.h>
> #include <unistd.h>
>
> /*
> * Gives about 25 seconds walllock time on my Alderlake CPU
> *
> * Probably want to reduce this x10, or possibly even x100
> * on AMD due to much slower ops.
> */
> #define MAX_ITER 10000000
>
> #define MAX_CPUS 4096
>
> void *doit(void *f) {
> unsigned long long rand;
> unsigned int i, success_rand = 0, success_seed = 0;
>
> for (i = 0; i < MAX_ITER; ++i) {
> success_seed += !!_rdseed64_step(&rand);
> }
> for (i = 0; i < MAX_ITER; ++i) {
> success_rand += !!_rdrand64_step(&rand);
> }
>
> fprintf(stderr,
> "RDRAND: %.2f%%, RDSEED: %.2f%%\n",
> success_rand * 100.0 / MAX_ITER,
> success_seed * 100.0 / MAX_ITER);
>
> return NULL;
> }
>
>
> int main(int argc, char *argv[])
> {
> pthread_t th[MAX_CPUS];
> int nproc = sysconf(_SC_NPROCESSORS_ONLN);
> if (nproc > MAX_CPUS) {
> nproc = MAX_CPUS;
> }
> fprintf(stderr, "Stressing RDRAND/RDSEED across %d CPUs\n", nproc);
>
> for (int i = 0 ; i < nproc;i ++) {
> pthread_create(&th[i], NULL, doit,NULL);
> }
>
> for (int i = 0 ; i < nproc;i ++) {
> pthread_join(th[i], NULL);
> }
>
> return 0;
> }
>
> $ gcc -march=native -o cpurngstress cpurngstress.c
Thanks for forwarding your test code along, we've added it to our
tests for comparison.
> > If there is the possibility of over-harvesting randomness, why not
> > design the implementations to be clamped at some per core value such
> > as a megabit/second. In the case of the documented RDSEED generation
> > rates, that would allow the servicing of 3222 cores, if my math at
> > 0530 in the morning is correct.
> >
> > Would a core need more than 128 kilobytes of randomness, ie. one
> > second of output, to effectively seed a random number generator?
> >
> > A cynical conclusion would suggest engineering acquiesing to marketing
> > demands... :-)
> My assumption is that it was simply easier to not implement a rate
> limiting feature at the CPU level and punt the starvation problem to
> software :-)
Could be, it does seem unlikely that random number generation speed
would be seen as fertile ground for marketing types.
Punting to software is certainly rationale, perhaps problematic in a
CoCo environment depending on the definition of 'astronomical'. See
my response to Borislav who was kind enough to respond to all of this.
> With regards,
> Daniel
Have a good day.
As always,
Dr. Greg
The Quixote Project - Flailing at the Travails of Cybersecurity
https://github.com/Quixote-Project