x86/mce/therm_throt incorrect THERM_STATUS_CLEAR_CORE_MASK?
From: Arnd Bergmann
Date: Thu Jun 02 2022 - 05:20:24 EST
I have a Xeon W-2265 (family 6, model 85, stepping 7) that started
constantly spewing messages from the therm_throt driver after one
core overheated:
May 31 13:57:54 kernel: [15512.209474] unchecked MSR access error:
WRMSR to 0x19c (tried to write 0x0000000000002a80) at rIP:
0xffffffff9f67f974 (native_write_msr+0x4/0x20)
May 31 13:57:54 kernel: [15512.209486] Call Trace:
May 31 13:57:54 kernel: [15512.209488] <TASK>
May 31 13:57:54 kernel: [15512.209489] ? throttle_active_work+0xea/0x1f0
May 31 13:57:54 kernel: [15512.209498] process_one_work+0x21d/0x3c0
May 31 13:57:54 kernel: [15512.209502] worker_thread+0x4d/0x3f0
May 31 13:57:54 kernel: [15512.209505] ? process_one_work+0x3c0/0x3c0
May 31 13:57:54 kernel: [15512.209508] kthread+0x127/0x150
May 31 13:57:54 kernel: [15512.209510] ? set_kthread_struct+0x40/0x40
May 31 13:57:54 kernel: [15512.209513] ret_from_fork+0x1f/0x30
...
May 31 13:57:59 kernel: [15517.333445] CPU11: Core temperature is
above threshold, cpu clock is throttled (total events = 3)
I could not find CPU model specific documentation for this register,
but I see that in [1], the bits 13 through 15 are marked as reserved
in some cases but not others. Manually writing the value 0xa80
instead of 0x2a80 from user space makes the warnings stop, so
my guess is that this CPU does not support the 0x2000 bit:
$ sudo wrmsr -p 11 0x19c 0xa80 ; dmesg
[177764.874555] msr: Write to unrecognized MSR 0x19c by wrmsr (pid: 142969).
[177764.874560] msr: See
https://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git/about for
details.
[177765.371180] CPU11: Core temperature/speed normal (total events = 42)
[177765.371180] CPU23: Core temperature/speed normal (total events = 42)
I have not tried the patch below, but I think this would address it on my
system, while likely breaking other machines. Any ideas what the
correct fix is?
Arnd
diff --git a/drivers/thermal/intel/therm_throt.c
b/drivers/thermal/intel/therm_throt.c
index 8352083b87c7..620d7f4c013e 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -196,7 +196,7 @@ static const struct attribute_group thermal_attr_group = {
#define THERM_THROT_POLL_INTERVAL HZ
#define THERM_STATUS_PROCHOT_LOG BIT(1)
-#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11) | BIT(13) | BIT(15))
+#define THERM_STATUS_CLEAR_CORE_MASK (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11))
#define THERM_STATUS_CLEAR_PKG_MASK (BIT(1) | BIT(3) | BIT(5) |
BIT(7) | BIT(9) | BIT(11))
static void clear_therm_status_log(int level)
[1] https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.pdf