Re: [PATCH] x86/microcode/intel: Panic on partial microcode update
From: Dave Hansen
Date: Tue Jun 30 2026 - 17:39:27 EST
On 6/30/26 12:13, Chang S. Bae wrote:
> The MSR IA32_MCU_STATUS, if available, may report a partial update status
> on a microcode update. This indicates that only a subset of microcode
> components was updated successfully while other parts of updates failed,
> which in turn leaves the system in an undefined and (potentially)
> unreliable state.
A modern individual microcode update update contains firmware for many
pieces of silicon inside the CPU. Sometimes, a single update operation
successfully updates some components and not others leaving a
partially-applied update.
This leaves the system in an undefined and unreliable state.
> The status is possible on newer CPUs. While validation is expected to
> prevent these error conditions in normal deployments, handling them
> explicitly protects systems against an otherwise undefined state.
I think that this is saying that CPUs try hard not to throw up their
hands and give up mid-update. But, the world is hard and things don't
always go to plan.
> +#define MSR_IA32_MCU_STATUS 0x0000007c
> +#define MCU_PARTIAL_UPDATE BIT(0)
> +#define AUTH_FAIL_ON_MCU_COMPONENT BIT(1)
Let's just make a :
#define MCU_STATUS_FAILURE_MASK (MCU_PARTIAL_UPDATE \
AUTH_FAIL_ON_MCU_COMPONENT)
> diff --git a/arch/x86/kernel/cpu/microcode/intel.c b/arch/x86/kernel/cpu/microcode/intel.c
> index f4a444e6114d..0b474a7c6986 100644
> --- a/arch/x86/kernel/cpu/microcode/intel.c
> +++ b/arch/x86/kernel/cpu/microcode/intel.c
> @@ -76,6 +76,9 @@ static struct microcode_intel *ucode_patch_late __read_mostly;
> /* last level cache size per core */
> static unsigned int llc_size_per_core __ro_after_init;
>
> +/* CPU capability for update status and staging support */
> +static bool cpu_has_mcu __ro_after_init;
These are *SUCH* slow, rare paths that I'm not even sure we need to
cache this.
> /* microcode format is extended from prescott processors */
> struct extended_signature {
> unsigned int sig;
> @@ -702,6 +705,16 @@ static enum ucode_state __apply_microcode(struct ucode_cpu_info *uci,
> /* write microcode via MSR 0x79 */
> native_wrmsrq(MSR_IA32_UCODE_WRITE, (unsigned long)mc->bits);
>
> + /* Check if the update put the system in an unreliable state */
> + if (cpu_has_mcu) {
> + u64 status = native_rdmsrq(MSR_IA32_MCU_STATUS);
> +
> + if (status & (MCU_PARTIAL_UPDATE | AUTH_FAIL_ON_MCU_COMPONENT)) {
> + pr_emerg("Partial update: MSR_IA32_MCU_STATUS=0x%llx\n", status);
> + nmi_panic(NULL, "Microcode load: fatal status from partial update");
> + }
> + }
Are both messages needed? Is this because nmi_panic() doesn't do
printk() formatting?
Also, let's just say:
nmi_panic(NULL, "Fatal microcode update failure");
> rev = intel_get_microcode_revision();
> if (rev != mc->hdr.rev)
> return UCODE_ERROR;
> @@ -779,11 +792,30 @@ static int __init save_builtin_microcode(void)
> }
> early_initcall(save_builtin_microcode);
>
> +#define CPUID_EDX_ARCH_CAP BIT(29)
> +
> +static __init bool mcu_capable(void)
> +{
> + if (native_cpuid_eax(0) < 7)
> + return false;
> +
> + if (!(native_cpuid_edx(7) & CPUID_EDX_ARCH_CAP))
> + return false;
> +
> + if (!(native_rdmsrq(MSR_IA32_ARCH_CAPABILITIES) & ARCH_CAP_MCU_ENUM))
> + return false;
> +
> + return true;
> +}
Comments, please.
"mcu_capable" sounds like "Is this CPU capable of microcode updates",
which doesn't seem quite right or logically sensible.
Second, this needs to say why it's using raw CPUID functions and not
X86_FEATURE* bits or cpuid().
> /* Load microcode on BSP from initrd or builtin blobs */
> void __init load_ucode_intel_bsp(struct early_load_data *ed)
> {
> struct ucode_cpu_info uci;
>
> + /* Indicate early enough to cover the early-loading path */
> + cpu_has_mcu = mcu_capable();
> +
> uci.mc = get_microcode_blob(&uci, false);
> ed->old_rev = uci.cpu_sig.rev;
>
> @@ -1023,8 +1055,7 @@ static __init bool staging_available(void)
> {
> u64 val;
>
> - val = x86_read_arch_cap_msr();
> - if (!(val & ARCH_CAP_MCU_ENUM))
> + if (!cpu_has_mcu)
> return false;
>
> rdmsrq(MSR_IA32_MCU_ENUMERATION, val);
Yeah, I'm not sure we need to cache mcu_capable(). Just call it twice.