[PATCH] x86/mce: Handle AMD threshold interrupt storms

From: Naik, Avadhut

Date: Fri Nov 21 2025 - 02:04:51 EST




On 11/21/2025 00:53, Greg KH wrote:
> On Thu, Nov 20, 2025 at 09:41:24PM +0000, Avadhut Naik wrote:
>> From: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>
>>
>> Extend the logic of handling CMCI storms to AMD threshold interrupts.
>>
>> Rely on the similar approach as of Intel's CMCI to mitigate storms per CPU and
>> per bank. But, unlike CMCI, do not set thresholds and reduce interrupt rate on
>> a storm. Rather, disable the interrupt on the corresponding CPU and bank.
>> Re-enable back the interrupts if enough consecutive polls of the bank show no
>> corrected errors (30, as programmed by Intel).
>>
>> Turning off the threshold interrupts would be a better solution on AMD systems
>> as other error severities will still be handled even if the threshold
>> interrupts are disabled.
>>
>> Also, AMD systems currently allow banks to be managed by both polling and
>> interrupts. So don't modify the polling banks set after a storm ends.
>>
>> [Tony: Small tweak because mce_handle_storm() isn't a pointer now]
>> [Yazen: Rebase and simplify]
>>
>> Stable backport notes:
>> 1. Currently, when a Machine check interrupt storm is detected, the bank's
>> corresponding bit in mce_poll_banks per-CPU variable is cleared by
>> cmci_storm_end(). As a result, on AMD's SMCA systems, errors injected or
>> encountered after the storm subsides are not logged since polling on that
>> bank has been disabled. Polling banks set on AMD systems should not be
>> modified when a storm subsides.
>>
>> 2. This patch is a snippet from the CMCI storm handling patch (link below)
>> that has been accepted into tip for v6.19. While backporting the patch
>> would have been the preferred way, the same cannot be undertaken since
>> its part of a larger set. As such, this fix will be temporary. When the
>> original patch and its set is integrated into stable, this patch should be
>> reverted.
>>
>> Signed-off-by: Smita Koralahalli <Smita.KoralahalliChannabasappa@xxxxxxx>
>> Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
>> Signed-off-by: Yazen Ghannam <yazen.ghannam@xxxxxxx>
>> Signed-off-by: Borislav Petkov (AMD) <bp@xxxxxxxxx>
>> Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@xxxxxxxxx>
>> Link: https://lore.kernel.org/20251104-wip-mca-updates-v8-0-66c8eacf67b9@xxxxxxx
>> Signed-off-by: Avadhut Naik <avadhut.naik@xxxxxxx>
>> ---
>> This is somewhat of a new scenario for me. Not really sure about the
>> procedure. Hence, haven't modified the commit message and removed the
>> tags. If required, will rework both.
>> Also, while this issue can be encountered on AMD systems using v6.8 and
>> later stable kernels, we would specifically prefer for this fix to be
>> backported to v6.12 since its LTS.
>
> What is the git commit id of this change in Linus's tree?

I think it has not yet been merged into mainline's master branch.
This commit was recently accepted into the tip (5th November).

Following is its commit ID:

a5834a5458aa004866e7da402c6bc2dfe2f3737e

Link: https://lore.kernel.org/all/176243356968.2601451.11559805061162819633.tip-bot2@tip-bot2/

Do I need to send another version with this commit ID mentioned in the commit message?

--
Thanks,
Avadhut Naik