Re: [PATCH] x86/mce: Restore MCA polling interval halving
From: Borislav Petkov
Date: Tue Apr 21 2026 - 08:06:04 EST
Hi Qiuxu,
On Mon, Apr 20, 2026 at 02:14:52PM +0000, Zhuo, Qiuxu wrote:
> 1. Test precondition:
> - Added debug messages [1] on top of Boris' patch.
> - RAS_CEC was disabled.
> - A correctable error was injected every 10 seconds.
>
> 2. Tested with CMCI interrupts enabled:
> - The message "Machine check events logged" was printed each time a correctable error was injected.
> - EDAC and mcelog in the decode chain were notified as expected.
>
> So, this part tested OK.
>
> 3. Tested in polling mode (boot with "mce=no_cmci"):
> - A CPU’s timer interval was halved after calling mce_log(), or when !mce_gen_pool_empty() was true during polling [2].
> - A CPU’s timer interval was doubled when mce_gen_pool_empty() was true during polling [2].
>
> This part tested OK, but please see comments below about mce_gen_pool_empty() check in mce_timer_fn().
Thanks for testing.
> mce_timer_fn()
> machine_check_poll()
> mce_log()
> irq_work_queue(&mce_irq_work)
> ...
> mce_irq_work_cb()
> mce_schedule_work()
> schedule_work(&mce_work)
> ...
> mce_gen_pool_process() // [3] worker thread concurrently running on any CPU handles MCE logs.
>
> mce_gen_pool_empty() // [4]
>
> It seems there is a race between [3] and [4].
> Although my testing did not observe this race, it's possible
> that mce_timer_fn() (in softirq) completes fast
> enough that it always finishes before [1] (in worker thread) is scheduled to run.
Does this and the next message in the thread explain the situation?
https://lore.kernel.org/r/20260207115142.GBaYcnTp7maUDVv3Nc@fat_crate.local
Bottom line: I don't think this was ever meant to be anything but a rough and
simple method to catch too many errors being logged and halve the polling
interval.
IOW, even if the above race happens, in the abundance of too many errors, it
would pick up and start halving eventually.
Right?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette