Re: [PATCH] x86/CPU/AMD: Clear RDRAND CPUID if Built-In-Self-Test failed on boot

From: Dmitry Safonov

Date: Wed Apr 29 2026 - 14:58:49 EST


Hi Borislav,

On Wed, Apr 29, 2026 at 3:08 AM Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> On April 28, 2026 10:11:50 PM UTC, Dmitry Safonov <dima@xxxxxxxxxx> wrote:
> >On Tue, Apr 28, 2026 at 7:21 PM Borislav Petkov <bp@xxxxxxxxx> wrote:
> >>
> >> On Tue, Apr 28, 2026 at 06:35:31PM +0100, Dmitry Safonov via B4 Relay wrote:
> >> > Yet, CPUID gets cleared only for previously known broken
> >> > implementations, see i.e., commit c49a0a80137c ("x86/CPU/AMD: Clear
> >> > RDRAND CPUID bit on AMD family 15h/16h"), that disabled RDRAND on
> >> > the same CPU family, where it was broken only after suspend-resume.
> >> >
> >> > As RDRAND is not masked in CPUID, some userspace may attempt using it,
> >>
> >> So why aren't you clearing the MSR bit even if our internal X86_FEATURE
> >> representation is cleared?
> >
> >That's exactly what this is about?
>
> I don't know what you mean here...?

Yeah, I could have done a better job, explaining the patch, my bad.

> You're doing a bunch of code to fix the case of what I understand is some ordering issue of init code which misses to clear the CPUID bit for RDRAND on those machines but then looking at the code again, x86_init_rdrand() runs *after* clear_rdrand_cpuid_bit()!

clear_rdrand_cpuid_bit() is called by init_amd() for families 0x15 and
0x16 under CONFIG_PM_SLEEP, as these AMD families are known to have
issues with RDRAND post-suspend/resume. What we have in Arista is
family 0x15 and model 0x60. We don't use suspend/resume or hibernate
on network switches for obvious reasons and in turn CONFIG_PM_SLEEP is
not enabled. Yet, these CPUs do produce only zeros (for rdrand
instruction) even on regular boot.

So, for a while we carried a platform-specific off-stream patch that
removed a check on IS_ENABLED(CONFIG_PM_SLEEP) in
clear_rdrand_cpuid_bit(). Yet, I don't think it's acceptable in
upstream Linux as it seems other people with 0x15 family may have
rdrand working fine (or perhaps few people disable CONFIG_PM_SLEEP,
I'm unsure).

I'm attempting to go the other way here: instead of attempting to
refine this black list of CPU families and conditions for which rdrand
is [known to be] broken, I think we can clear the MSR on AMD whenever
the RDRAND test/BIST detects it as non-functional. Because under the
current conditions, the kernel not using rdrand, as x86_init_rdrand()
properly detects rdrand brokeness and clears the CPU cap, yet the MSR
isn't cleared because this configuration is not known to be broken.

> So the CPUID bit should have been cleared already by the time userspace is up.
>
> So I guess I still don't know what exactly you're fixing here.
>
> Maybe try to explain again...?

Thanks,
Dmitry