Re: [PATCH v15 22/23] x86/mce: Improve error log of kernel space TDX #MC due to erratum

From: Borislav Petkov
Date: Tue Dec 05 2023 - 09:26:18 EST


On Fri, Nov 10, 2023 at 12:55:59AM +1300, Kai Huang wrote:
> +static const char *mce_memory_info(struct mce *m)
> +{
> + if (!m || !mce_is_memory_error(m) || !mce_usable_address(m))
> + return NULL;
> +
> + /*
> + * Certain initial generations of TDX-capable CPUs have an
> + * erratum. A kernel non-temporal partial write to TDX private
> + * memory poisons that memory, and a subsequent read of that
> + * memory triggers #MC.
> + *
> + * However such #MC caused by software cannot be distinguished
> + * from the real hardware #MC. Just print additional message
> + * to show such #MC may be result of the CPU erratum.
> + */
> + if (!boot_cpu_has_bug(X86_BUG_TDX_PW_MCE))
> + return NULL;
> +
> + return !tdx_is_private_mem(m->addr) ? NULL :
> + "TDX private memory error. Possible kernel bug.";
> +}
> +
> static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> {
> struct llist_node *pending;
> struct mce_evt_llist *l;
> int apei_err = 0;
> + const char *memmsg;
>
> /*
> * Allow instrumentation around external facilities usage. Not that it
> @@ -283,6 +307,15 @@ static noinstr void mce_panic(const char *msg, struct mce *final, char *exp)
> }
> if (exp)
> pr_emerg(HW_ERR "Machine check: %s\n", exp);
> + /*
> + * Confidential computing platforms such as TDX platforms
> + * may occur MCE due to incorrect access to confidential
> + * memory. Print additional information for such error.
> + */
> + memmsg = mce_memory_info(final);
> + if (memmsg)
> + pr_emerg(HW_ERR "Machine check: %s\n", memmsg);
> +

No, this is not how this is done. First of all, this function should be
called something like

mce_dump_aux_info()

or so to state that it is dumping some auxiliary info.

Then, it does:

if (cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return tdx_get_mce_info();

or so and you put that tdx_get_mce_info() function in TDX code and there
you do all your picking apart of things, what needs to be dumped or what
not, checking whether it is a memory error and so on.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette