RE: [PATCH] x86/mce: Restore MCA polling interval halving

From: Zhuo, Qiuxu

Date: Tue Apr 07 2026 - 11:11:33 EST


Hi Boris,

> From: Borislav Petkov <bp@xxxxxxxxx>
> Sent: Tuesday, April 7, 2026 6:49 AM
> To: Li,Rongqing(ACG CCN) <lirongqing@xxxxxxxxx>
> Cc: Luck, Tony <tony.luck@xxxxxxxxx>; Nikolay Borisov
> <nik.borisov@xxxxxxxx>; Thomas Gleixner <tglx@xxxxxxxxxx>; Ingo Molnar
> <mingo@xxxxxxxxxx>; Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>;
> x86@xxxxxxxxxx; H . Peter Anvin <hpa@xxxxxxxxx>; Yazen Ghannam
> <yazen.ghannam@xxxxxxx>; Zhuo, Qiuxu <qiuxu.zhuo@xxxxxxxxx>;
> Avadhut Naik <avadhut.naik@xxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx; linux-
> edac@xxxxxxxxxxxxxxx
> Subject: [PATCH] x86/mce: Restore MCA polling interval halving
>
> Ok,
>
> finally. :-\
>
> Pls run it to make sure it DTRT for you too.
>
> Thx.
>
> ---
> From: "Borislav Petkov (AMD)" <bp@xxxxxxxxx>
> Date: Mon, 16 Mar 2026 16:12:00 +0100
> Subject: [PATCH] x86/mce: Restore MCA polling interval halving
>
> RongQing reported that the MCA polling interval doesn't halve when an error
> gets logged. It was traced down to the commit in Fixes: because:
>
> mce_timer_fn()
> |-> mce_poll_banks()
> |-> machine_check_poll()
> |-> mce_log()
>
> which will queue the work and return.
>
> Now, back in mce_timer_fn():
>
> /*
> * Alert userspace if needed. If we logged an MCE, reduce the polling
> * interval, otherwise increase the polling interval.
> */
> if (mce_notify_irq())
>
> <--- here we haven't ran the notifier chain yet so mce_need_notify is not set
> yet so this won't hit and we won't halve the interval iv.
>
> Now the notifier chain runs. mce_early_notifier() sets the bit, does
> mce_notify_irq(), that clears the bit and then the notifier chain a little later
> logs the error.
>
> So this is a silly timing issue.
>
> But, that's all unnecessary.
>
> All it needs to happen here is, the "should we notify of a logged MCE"
> mce_notify_irq() asks, should be simply a question to the mce gen pool:
> "Are you empty?"
>
> And that then turns into a simple yes or no answer and it all JustWorks(tm).
>
> So do that.
>
> Fixes: 011d82611172 ("RAS: Add a Corrected Errors Collector")
> Reported-by: Li RongQing <lirongqing@xxxxxxxxx>
> Signed-off-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>
> Link: https://lore.kernel.org/r/20260112082747.2842-1-
> lirongqing@xxxxxxxxx
> ---
> arch/x86/kernel/cpu/mce/core.c | 7 +------
> 1 file changed, 1 insertion(+), 6 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 8dd424ac5de8..d18db7d8d237 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -90,7 +90,6 @@ struct mca_config mca_cfg __read_mostly = { };
>
> static DEFINE_PER_CPU(struct mce_hw_err, hw_errs_seen); -static unsigned
> long mce_need_notify;
>
> /*
> * MCA banks polled by the period polling timer for corrected events.
> @@ -595,7 +594,7 @@ static bool mce_notify_irq(void)
> /* Not more than two messages every minute */
> static DEFINE_RATELIMIT_STATE(ratelimit, 60*HZ, 2);
>
> - if (test_and_clear_bit(0, &mce_need_notify)) {
> + if (!mce_gen_pool_empty()) {
> mce_work_trigger();
>
> if (__ratelimit(&ratelimit))
> @@ -618,10 +617,6 @@ static int mce_early_notifier(struct notifier_block
> *nb, unsigned long val,
> /* Emit the trace record: */
> trace_mce_record(err);
>
> - set_bit(0, &mce_need_notify);
> -
> - mce_notify_irq();
> -

I injected a correctable error with the CMCI interrupt enabled on an Intel testing machine,
and this mce_early_notifier() was invoked. But the following code in mce_notify_irq() is now
never executed, and I didn't see the error log message "Machine check events logged".

...
mce_work_trigger();

if (__ratelimit(&ratelimit))
pr_info(HW_ERR "Machine check events logged\n");

return true;
...

Thanks!
Qiuxu