Re: 答复: 答复: 答复: [外部邮件] Re: [PATCH] x86/mce: Fix timer interval adjustment after logging a MCE event
From: Nikolay Borisov
Date: Tue Jan 13 2026 - 13:55:11 EST
On 13.01.26 г. 20:53 ч., Luck, Tony wrote:
The comment in mce_timer_fn says to adjust the polling interval, but
I notice the kernel log always shows an MCE log every 5 minutes. Is this
normal?
Use git annotate to figure out which patch added this comment and in context
of what and that'll tell you why.
As to the 5 minutes, look at how the check interval gets established.
Once upon a time the polling interval started out at 5 minutes, but the
interval was halved each time an error was found (so interval went
150s, 75s, 37s, ... down to 1s). If no error was found, then the interval
was doubled (going back up to 300s).
This is described in the comment:
/*
* Alert userspace if needed. If we logged an MCE, reduce the polling
* interval, otherwise increase the polling interval.
*/
It seems that the kernel isn't doing that today. Polling at a fixed 300 seconds
event though errors are being found and logged. Interesting that the timestamps
are 327.68 seconds apart, rather than 300 and change. So there is some strange
stuff going on.
I can reproduce here on an Icelake system. Booted with mce=no_cmci to force polling
(and turned of BIOS firmware first mode). Injecting an error every 30 seconds I also see
constant 327 seconds between logs (multiple logs show up because my injection picks memory
channel "randomly", so there can be several banks with errors when polling happens).
$ dmesg | grep 'Machine Check Event:'
[ 662.632988] EDAC skx MC4: CPU 40: Machine Check Event: 0x0 Bank 13: 0x8c00014200800090
[ 662.727377] EDAC skx MC6: CPU 40: Machine Check Event: 0x0 Bank 21: 0x8c0000c200800090
[ 990.283484] EDAC skx MC4: CPU 121: Machine Check Event: 0x0 Bank 13: 0x8c00010200800090
[ 990.378233] EDAC skx MC6: CPU 121: Machine Check Event: 0x0 Bank 21: 0x8c00014200800090
[ 990.467199] EDAC skx MC0: CPU 3: Machine Check Event: 0x0 Bank 13: 0x8c00004200800090
[ 1317.939260] EDAC skx MC4: CPU 122: Machine Check Event: 0x0 Bank 13: 0x8c00010200800090
[ 1318.033721] EDAC skx MC6: CPU 122: Machine Check Event: 0x0 Bank 21: 0x8c00010200800090
[ 1318.122612] EDAC skx MC0: CPU 14: Machine Check Event: 0x0 Bank 13: 0x8c00004200800090
[ 1318.211507] EDAC skx MC2: CPU 14: Machine Check Event: 0x0 Bank 21: 0x8c00004200800090
[ 1645.590773] EDAC skx MC4: CPU 129: Machine Check Event: 0x0 Bank 13: 0x8c00010200800090
[ 1645.685153] EDAC skx MC6: CPU 129: Machine Check Event: 0x0 Bank 21: 0x8c00018200800090
[ 1645.773744] EDAC skx MC0: CPU 100: Machine Check Event: 0x0 Bank 13: 0x8c00004200800090
-Tony
At this stage I think lirongqi's patch is ok, but in the long run (i.e tomorrow) I will send a patch that simply eliminates mce_notify_irq's call in mce_timer_fn. I.e that function should be called only from the early notifier.