答复: [外部邮件] Re: [PATCH] x86/mce: Fix timer interval adjustment after logging a MCE event

From: Li,Rongqing

Date: Mon Feb 02 2026 - 18:50:57 EST


> On Wed, Jan 14, 2026 at 03:48:13PM +0100, Borislav Petkov wrote:
> > Now on to find what causes this. Even if we can't find the proper
> > commit, I guess testing 6.18 and 6.12 - the LTS kernels - should be
> > good enough as to backport a fix there.
>
> Ok, finally back to staring at this.
>
> Looks like adding this:
>
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index 34440021e8cf..b94efe5950c4 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -154,6 +154,8 @@ void mce_log(struct mce_hw_err *err) {
> if (mce_gen_pool_add(err))
> irq_work_queue(&mce_irq_work);
> +
> + set_bit(0, &mce_need_notify);
> }
> EXPORT_SYMBOL_GPL(mce_log);
>
> makes the interval halve again, see below for the timestamps.
>
> I guess I'll do a proper patch from the hunk here:
>
> https://lore.kernel.org/r/20260113224152.GVaWbKMMzManQ5WwlT@fat_cr
> ate.local
>

Is it possible where CPU0 sets mce_need_notify, but CPU1 concurrently calls mce_notify_irq in mce_timer_fn, and then CPU1 sets its own timer to 1/2 instead of CPU0's

[Li,Rongqing]



> along with 6.12 and 6.18 backports and see whether that's a good enough as a
> stable fix too.
>
> Thx.
>
> [ 316.795248] mce: [Hardware Error]: Machine check events logged
> [ 316.795262] mce: [Hardware Error]: Machine check events logged
> [ 316.798331] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:
> 9c2041000000011b [ 316.800104] mce: [Hardware Error]: TSC 0 ADDR
> 6d3d483b [ 316.801442] mce: [Hardware Error]: PROCESSOR 2:800f82 TIME
> 1770040950 SOCKET 0 APIC 0 microcode 800820d [ 628.091492] mce:
> [Hardware Error]: Machine check events logged [ 628.091515] mce:
> [Hardware Error]: Machine check events logged [ 628.097216] mce:
> [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9c2041000000011b
> [ 628.101393] mce: [Hardware Error]: TSC 0 ADDR 6d3d483b [ 628.103992]
> mce: [Hardware Error]: PROCESSOR 2:800f82 TIME 1770041262 SOCKET 0
> APIC 0 microcode 800820d
>
> <--- it starts decreasing the interval here.
>
> [ 825.581354] hrtimer: interrupt took 18820 ns [ 939.387367] mce:
> [Hardware Error]: Machine check events logged [ 939.390185] mce:
> [Hardware Error]: Machine check events logged [ 939.392936] mce:
> [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9c2041000000011b
> [ 939.396465] mce: [Hardware Error]: TSC 0 ADDR 6d3d483b [ 939.399042]
> mce: [Hardware Error]: PROCESSOR 2:800f82 TIME 1770041573 SOCKET 0
> APIC 0 microcode 800820d [ 1103.227402] mce: [Hardware Error]: Machine
> check events logged [ 1103.230267] mce: [Hardware Error]: Machine check
> events logged [ 1103.233018] mce: [Hardware Error]: CPU 0: Machine Check: 0
> Bank 4: 9c2041000000011b [ 1103.236565] mce: [Hardware Error]: TSC 0
> ADDR 6d3d483b [ 1103.239146] mce: [Hardware Error]: PROCESSOR 2:800f82
> TIME 1770041737 SOCKET 0 APIC 0 microcode 800820d [ 1179.003479] mce:
> [Hardware Error]: Machine check events logged [ 1179.006452] mce:
> [Hardware Error]: Machine check events logged [ 1179.009144] mce:
> [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9c2041000000011b
> [ 1179.012757] mce: [Hardware Error]: TSC 0 ADDR 6d3d483b [ 1179.015338]
> mce: [Hardware Error]: PROCESSOR 2:800f82 TIME 1770041813 SOCKET 0
> APIC 0 microcode 800820d [ 1217.915386] mce: [Hardware Error]: CPU 0:
> Machine Check: 0 Bank 4: 9c2041000000011b [ 1217.919088] mce:
> [Hardware Error]: TSC 0 ADDR 6d3d483b [ 1217.921662] mce: [Hardware
> Error]: PROCESSOR 2:800f82 TIME 1770041852 SOCKET 0 APIC 0 microcode
> 800820d [ 1238.395440] mce: [Hardware Error]: CPU 0: Machine Check: 0
> Bank 4: 9c2041000000011b [ 1238.399041] mce: [Hardware Error]: TSC 0
> ADDR 6d3d483b [ 1238.401619] mce: [Hardware Error]: PROCESSOR 2:800f82
> TIME 1770041872 SOCKET 0 APIC 0 microcode 800820d [ 1269.115368]
> mce_notify_irq: 4 callbacks suppressed [ 1269.117829] mce: [Hardware Error]:
> Machine check events logged [ 1269.120586] mce: [Hardware Error]: Machine
> check events logged [ 1269.123412] mce: [Hardware Error]: CPU 0: Machine
> Check: 0 Bank 4: 9c2041000000011b [ 1269.126950] mce: [Hardware Error]:
> TSC 0 ADDR 6d3d483b [ 1269.129511] mce: [Hardware Error]: PROCESSOR
> 2:800f82 TIME 1770041903 SOCKET 0 APIC 0 microcode 800820d
>
> and then it started enlarging it again when I changed the injection interval to
> 300s.
>
> [ 1578.363408] mce: [Hardware Error]: Machine check events logged
> [ 1578.366346] mce: [Hardware Error]: Machine check events logged
> [ 1578.369174] mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:
> 9c2041000000011b [ 1578.372742] mce: [Hardware Error]: TSC 0 ADDR
> 6d3d483b [ 1578.375226] mce: [Hardware Error]: PROCESSOR 2:800f82 TIME
> 1770042212 SOCKET 0 APIC 0 microcode 800820d [ 2119.035460] mce:
> [Hardware Error]: Machine check events logged [ 2119.038432] mce:
> [Hardware Error]: Machine check events logged [ 2119.041236] mce:
> [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9c2041000000011b
> [ 2119.044846] mce: [Hardware Error]: TSC 0 ADDR 6d3d483b [ 2119.047340]
> mce: [Hardware Error]: PROCESSOR 2:800f82 TIME 1770042753 SOCKET 0
> APIC 0 microcode 800820d [ 2282.875491] mce: [Hardware Error]: Machine
> check events logged [ 2282.878409] mce: [Hardware Error]: Machine check
> events logged [ 2282.881277] mce: [Hardware Error]: CPU 0: Machine Check: 0
> Bank 4: 9c2041000000011b [ 2282.884978] mce: [Hardware Error]: TSC 0
> ADDR 6d3d483b [ 2282.887482] mce: [Hardware Error]: PROCESSOR 2:800f82
> TIME 1770042917 SOCKET 0 APIC 0 microcode 800820d [ 2512.251516] mce:
> [Hardware Error]: Machine check events logged [ 2512.254371] mce:
> [Hardware Error]: Machine check events logged [ 2512.257261] mce:
> [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9c2041000000011b
> [ 2512.260841] mce: [Hardware Error]: TSC 0 ADDR 6d3d483b [ 2512.263406]
> mce: [Hardware Error]: PROCESSOR 2:800f82 TIME 1770043146 SOCKET 0
> APIC 0 microcode 800820d
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette