Re: [PATCH v3 3/3] x86/mce: Add CMCI storm switching support for Zhaoxin

From: Tony W Wang-oc
Date: Sun Sep 22 2024 - 23:42:54 EST


Resend this mail because I received the message: Undelivered Mail Returned to Sender

On 2024/9/20 20:09, Tony W Wang-oc wrote:


On 2024/9/20 19:44, Zhuo, Qiuxu wrote:


From: Tony W Wang-oc <TonyWWang-oc@xxxxxxxxxxx>
[...]
--- a/arch/x86/kernel/cpu/mce/zhaoxin.c
+++ b/arch/x86/kernel/cpu/mce/zhaoxin.c
@@ -63,3 +63,21 @@ void mce_zhaoxin_feature_clear(struct cpuinfo_x86
*c) {
        intel_clear_lmce();
   }
+
+void mce_zhaoxin_handle_storm(int bank, bool on) {
+     unsigned long flags;
+     u64 val;
+
+     raw_spin_lock_irqsave(&cmci_discover_lock, flags);
+     rdmsrl(MSR_IA32_MCx_CTL2(bank), val);
+     if (on) {
+             val &= ~(MCI_CTL2_CMCI_EN |
MCI_CTL2_CMCI_THRESHOLD_MASK);
+             val |= CMCI_STORM_THRESHOLD;
+     } else {
+             val &= ~MCI_CTL2_CMCI_THRESHOLD_MASK;
+             val |= (MCI_CTL2_CMCI_EN | cmci_threshold[bank]);
+     }
+     wrmsrl(MSR_IA32_MCx_CTL2(bank), val);
+     raw_spin_unlock_irqrestore(&cmci_discover_lock, flags); }

Are there any reasons or comments why it needs to disable/enable the
CMCI interrupt here during a CMCI storm on/off? If not, then reuse
mce_intel_handle_storm() to avoid duplicating the code.


As explained in another email.
The reason is actually mentioned in the cover letter: "because Zhaoxin's UCR
error is not reported through CMCI", and we want to disable CMCI interrupt
when CMCI storm happened.

So, this is just you want to disable CMCI when a CMCI storm happens.
This doesn't explain much to me.
What's the problem if not disable CMCI when a CMCI storm happens?


In practice, we have encountered a lot of CE errors such as DRAM CE errors, so it feels safer to disable CMCI interrupt than to set a large threshold. At the same time, Zhaoxin's UCR is not reported through CMCI, so we implemented like this.

Sincerely
TonyWWang-oc