Re: 答复: 答复: 答复: [外部邮件] Re: [PATCH] x86/mce: Fix timer interval adjustment after logging a MCE event

From: Borislav Petkov

Date: Tue Jan 13 2026 - 16:32:24 EST


On Tue, Jan 13, 2026 at 09:05:01PM +0000, Luck, Tony wrote:
> >> $ dmesg | grep 'Machine Check Event:'
> >
> > Did you see the "Machine check events logged\n" print from mce_notify_irq() in
> > dmesg too?
>
> Yes. I used the other grep pattern to see detail of which CPU/bank logged the error.
> Same pattern of timestamps shows up with this grep too.

Yah, this confirms the flow:

mce_timer_fn()-> ... -> machine_check_poll -> mce_log which will queue the
work and return.

Now, back in mce_timer_fn:

/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
*/
if (mce_notify_irq())


<--- we haven't ran the notifier chain yet so mce_need_notify is not set yet
so this won't hit and we won't halve the interval. I need to verify that
empirically.

iv = max(iv / 2, (unsigned long) HZ/100);
else
iv = min(iv * 2, round_jiffies_relative(check_interval * HZ));

And now the notifier chain runs. mce_early_notifier() sets the bit, does
mce_notify_irq(), that clears the bit and then the notifier chain a little
later (skx_edac) logs the error.

So this looks like a silly timing issue...

We could set mce_need_notify in mce_log(), zap this thing:

if (__ratelimit(&ratelimit))
pr_info(HW_ERR "Machine check events logged\n");

in mce_notify_irq() or at least predicate it on the CEC being enabled and then
not call mce_notify_irq() in the notifier but leave it be called in the timer
function...

Ufff, how silly and overengineered we've made it. I need to think about
a cleaner solution tomorrow...

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette