Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single
From: Borislav Petkov
Date: Thu May 20 2021 - 05:21:15 EST
On Wed, May 19, 2021 at 09:12:04PM -0600, James Feeney wrote:
> $ diff .config .config.old
> 4983c4983,4984
> < # CONFIG_X86_THERMAL_VECTOR is not set
> ---
> > CONFIG_X86_THERMAL_VECTOR=y
> > CONFIG_X86_PKG_TEMP_THERMAL=m
>
> No joy. Still have the same soft lockups and full boots - the full
> boots interrupted by some mystery delay.
Which means, even with therm_throt disabled, it still locks up. Which
can't be caused by my patch.
> I don't know about these patches, modifying and moving the location of
> therm_throt.c, so I'm not in a position to draw any conclusion from
> these results.
They're just moving the thermal interrupt functionality from the
MCE code where they don't belong to the thermal code where they do.
Otherwise there should be no change.
> build 5.11? There are lots of 5.11 kernels from the Arch distribution
> that I have run. Are you looking for a dmesg log from 5.11?
Take the .config you're normally using, make sure it has
CONFIG_X86_THERMAL_VECTOR=y
and build with it plain 5.11 kernel. No patches ontop, no nothing.
Then add
debug ignore_loglevel log_buf_len=16M no_console_suspend systemd.log_target=null console=ttyS0,115200 console=tty0
to its kernel command line and send me full dmesg again pls.
Looking how it sometimes boots and sometimes it locks up, try that a
couple of times.
> So far, something looks quirky - somewhere. Timing related failures
> can be a pain. Is there no useful information being provided by the
> Call Trace in the dmesg log?
What I'm seeing is that *sometimes* - not always - your CPU0 is not
responding to the TLB flush IPI. Which is really weird. Have you had
those always or did they start appearing with 5.12?
That's why I'm still scratching my head over how my patch would cause
CPU0 not responding to IPIs.
Well, *maybe* there's a little difference which my patch did: it does
that APIC_LVTTHMR only on the BSP. And *maybe* there's a problem there,
who knows with those old CPUs.
So here's two more things to try:
1. On plain 5.12, with the same kernel cmdline params add also
"idle=nomwait"
to the kernel command line and boot with it a couple of times to see
whether it still locks up.
2. On plain 5.12, with the same kernel cmdline params apply this hunk
ontop:
---
diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
index f8e882592ba5..42db48cd4666 100644
--- a/drivers/thermal/intel/therm_throt.c
+++ b/drivers/thermal/intel/therm_throt.c
@@ -630,9 +630,8 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
if (!intel_thermal_supported(c))
return;
- /* On the BSP? */
- if (c == &boot_cpu_data)
- lvtthmr_init = apic_read(APIC_LVTTHMR);
+ lvtthmr_init = apic_read(APIC_LVTTHMR);
+ pr_info("%s: CPU%d, lvtthmr_init: 0x%x\n", __func__, cpu, lvtthmr_init);
/*
* First check if its enabled already, in which case there might
---
That'll tell us the thermal sensor LVT on both CPUs.
Also do that a couple of times - it'll be interesting to see what those
values are *when* the box locks up.
As always, catch full dmesg each time pls.
Thx.
--
Regards/Gruss,
Boris.
SUSE Software Solutions Germany GmbH, GF: Felix Imendörffer, HRB 36809, AG Nürnberg