Re: linux 5.12 - fails to boot - soft lockup - CPU#0 stuck for 23s! - RIP smp_call_function_single

From: James Feeney
Date: Tue May 25 2021 - 00:02:39 EST


On 5/24/21 1:51 AM, Borislav Petkov wrote:
> On Sun, May 23, 2021 at 05:02:01PM -0600, James Feeney wrote:
>> Ha! Yes, your patch *is* the culprit. You don't trust git bisect?
>
> Well, git-bisect can very easily veer off into the fields if the
> bisector makes a mistake. I'm not saying you have made a mistake but I
> have done that mistake a bunch of times and have seen others do it too
> so it is very easy to get lost.
>

Quite so, quite so. But the process, by necessity, is incremental, a step-wise process of increasing knowledge. It is essentially one of the many levels of the same process described in "The Cathedral and the Bazaar". Though, it can be tedious at times.

> And with my patch simply moving the LVT THMR read back in the boot
> order, I was very sceptical how can that even be?!
>
> But...
>
>> "lvtthmr_init: 0x200" != "lvtthmr_init: 0x10200" != "lvtthmr_init: 0x10000"
>
> ... *this* is a good catch, I *think* I know what happens and the next
> patch will confirm my theory, see the end of the mail. Remove all diffs
> you have ontop of your 5.12 kernel, apply the one below, do the exact
> same exercise with it and send me one full dmesg pls.
>

It is not helpful to describe the addition of one pr_info statement in terms of "Remove all diffs you have ontop of your 5.12 kernel". I prefer to apply each of these by hand, so that I can see what is actually happening, and there are no actual other additional changes, other than this pr_info addition to clear_local_APIC() in arch/x86/kernel/apic/apic.c.

>> System Management is *hard*, because it must build upon someone else's
>> undocumented buggy software. Thank Intel.
>
> No, SMM is shit. Complain to the OEMs about it.
>

Hmm - an important ethical question, determining to whom blame should be assigned. I rather take the position, to put this in melodramatic terms, of not allowing little children, crazy people, fanatics, or politicians, to have guns. Intel know - or should know - that hardware OEMs are irresponsible software providers. There is overwhelming evidence, to this. And yet, Intel continue to choose to provide this SMM "gun" to these OEM "crazy people". The outcome, then, simply becomes predictable, and inevitable. I, therefore, prefer to hold Intel, and by extension, AMD, responsible. A "solid ethical foundation" is the most important element in the machine result.

> However, this time I think it is *we* who are not doing stuff as
> correctly as we should but let's see your results first.
>
> Thx.
>

Since I am not familiar with the details, I am interested to know your final analysis. Still, I take note of the present language in the comment, in intel_init_thermal():

====
* The initial value of thermal LVT entries on all APs always reads
* 0x10000 because APs are woken up by BSP issuing INIT-SIPI-SIPI
* sequence to them and LVT registers are reset to 0s except for
* the mask bits which are set to 1s when APs receive INIT IPI.
* If BIOS takes over the thermal interrupt and sets its interrupt
* delivery mode to SMI (not fixed), it restores the value that the
* BIOS has programmed on AP based on BSP's info we saved since BIOS
* is always setting the same value for all threads/cores.
====

But then, it is entirely unclear to me exactly "who is doing what" here, while we see these values, resulting under various circumstances, at different times, as 0x10000, 0x10200, and 0x200.


Addressing the pr_info patch in clear_local_APIC(), first we note the comment "Also used to cleanout any BIOS leftovers during boot", and then discover that clear_local_APIC() is called during power-down, but is *never called* during boot. To be sure that this is true, I've added
pr_info("%s: before : CPU%d, maxlvt: %d\n", __func__, smp_processor_id(), maxlvt);
and
pr_info("%s: after : CPU%d, maxlvt: %d\n", __func__, smp_processor_id(), maxlvt);
around
if (maxlvt >= 5) {...}

since maybe we want to know the actual value of maxlvt during boot, on the possibility that this value might have changed.

With these additional pr_info statements, we see with certainty that clear_local_APIC() is *never* called during boot. Presumably, this implies smp_found_config in init_bsp_APIC(), which otherwise would have called clear_local_APIC(). dmidecode says "SMBIOS 2.4 present".

dmesg log for a power-up to power-down cycle attached. Here is the quick look:

$ grep -C 1 'mcheck\|lvtt\|clear' dmesglog.5.12.info.context.1

[ 0.171645] Booting paravirtualized kernel on bare hardware
[ 0.171648] mcheck_intel_therm_init: lvtthmr_init: 0x200
[ 0.171651] mce: mcheck_init : mcheck_intel_therm_init() returned
[ 0.171657] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 6370452778343963 ns
--
[ 1.251708] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[ 1.255457] intel_init_thermal: CPU0, lvtthmr_init: 0x200, read: 0x200, misc_enable (low): 0x64952489
[ 1.258348] CPU0: Thermal monitoring handled by SMI
--
[ 1.451691] .... node #0, CPUs: #1
[ 0.944189] intel_init_thermal: CPU1, lvtthmr_init: 0x200, read: 0x200, misc_enable (low): 0x64952489
[ 1.468467] smp: Brought up 1 node, 2 CPUs
--
[ 153.299300] reboot: Power down
[ 153.304433] clear_local_APIC: before : CPU0, maxlvt: 5
[ 153.304433] clear_local_APIC: before : CPU1, maxlvt: 5
[ 153.304437] clear_local_APIC: CPU1, maxlvt: 5, APIC_LVTTHMR: 0x200, masking LVT
[ 153.311404] clear_local_APIC: CPU0, maxlvt: 5, APIC_LVTTHMR: 0x200, masking LVT
[ 153.316525] clear_local_APIC: after : CPU1, maxlvt: 5
[ 153.325673] clear_local_APIC: after : CPU0, maxlvt: 5
[ 153.346744] clear_local_APIC: before : CPU0, maxlvt: 5
[ 153.353718] clear_local_APIC: CPU0, maxlvt: 5, APIC_LVTTHMR: 0x10200, masking LVT
[ 153.363062] clear_local_APIC: after : CPU0, maxlvt: 5
[ 153.369985] acpi_power_off called


James


> ---
>
> diff --git a/arch/x86/include/asm/mce.h b/arch/x86/include/asm/mce.h
> index ddfb3cad8dff..5ac8b827bc12 100644
> --- a/arch/x86/include/asm/mce.h
> +++ b/arch/x86/include/asm/mce.h
> @@ -296,6 +296,12 @@ struct cper_sec_mem_err;
> extern void apei_mce_report_mem_error(int corrected,
> struct cper_sec_mem_err *mem_err);
>
> +#ifdef CONFIG_X86_THERMAL_VECTOR
> +extern void mcheck_intel_therm_init(void);
> +#else
> +static inline void mcheck_intel_therm_init(void) { }
> +#endif
> +
> /*
> * Enumerate new IP types and HWID values in AMD processors which support
> * Scalable MCA.
> diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c
> index 4a39fb429f15..f21009786877 100644
> --- a/arch/x86/kernel/apic/apic.c
> +++ b/arch/x86/kernel/apic/apic.c
> @@ -1162,6 +1162,10 @@ void clear_local_APIC(void)
> #ifdef CONFIG_X86_THERMAL_VECTOR
> if (maxlvt >= 5) {
> v = apic_read(APIC_LVTTHMR);
> +
> + pr_info("%s: CPU%d, maxlvt: %d, APIC_LVTTHMR: 0x%x, masking LVT\n",
> + __func__, smp_processor_id(), maxlvt, v);
> +
> apic_write(APIC_LVTTHMR, v | APIC_LVT_MASKED);
> }
> #endif
> diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
> index bf7fe87a7e88..ded20b8612fe 100644
> --- a/arch/x86/kernel/cpu/mce/core.c
> +++ b/arch/x86/kernel/cpu/mce/core.c
> @@ -2190,6 +2190,7 @@ __setup("mce", mcheck_enable);
>
> int __init mcheck_init(void)
> {
> + mcheck_intel_therm_init();
> mce_register_decode_chain(&early_nb);
> mce_register_decode_chain(&mce_uc_nb);
> mce_register_decode_chain(&mce_default_nb);
> diff --git a/drivers/thermal/intel/therm_throt.c b/drivers/thermal/intel/therm_throt.c
> index f8e882592ba5..0ebd2386839f 100644
> --- a/drivers/thermal/intel/therm_throt.c
> +++ b/drivers/thermal/intel/therm_throt.c
> @@ -621,19 +621,30 @@ bool x86_thermal_enabled(void)
> return atomic_read(&therm_throt_en);
> }
>
> +void __init mcheck_intel_therm_init(void)
> +{
> + /*
> + * This function is only called on boot CPU. Save the init thermal
> + * LVT value on BSP and use that value to restore APs' thermal LVT
> + * entry BIOS programmed later
> + */
> + if (intel_thermal_supported(&boot_cpu_data)) {
> + lvtthmr_init = apic_read(APIC_LVTTHMR);
> + pr_info("%s: lvtthmr_init: 0x%x\n", __func__, lvtthmr_init);
> + } else {
> + pr_info("%s: !intel_thermal_supported\n", __func__);
> + }
> +}
> +
> void intel_init_thermal(struct cpuinfo_x86 *c)
> {
> unsigned int cpu = smp_processor_id();
> int tm2 = 0;
> - u32 l, h;
> + u32 l, h, tmp = -1;
>
> if (!intel_thermal_supported(c))
> return;
>
> - /* On the BSP? */
> - if (c == &boot_cpu_data)
> - lvtthmr_init = apic_read(APIC_LVTTHMR);
> -
> /*
> * First check if its enabled already, in which case there might
> * be some SMM goo which handles it, so we can't even put a handler
> @@ -652,13 +663,17 @@ void intel_init_thermal(struct cpuinfo_x86 *c)
> * BIOS has programmed on AP based on BSP's info we saved since BIOS
> * is always setting the same value for all threads/cores.
> */
> - if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED)
> + if ((h & APIC_DM_FIXED_MASK) != APIC_DM_FIXED) {
> apic_write(APIC_LVTTHMR, lvtthmr_init);
> + tmp = apic_read(APIC_LVTTHMR);
> + }
>
> + pr_info("%s: CPU%d, lvtthmr_init: 0x%x, read: 0x%x, misc_enable (low): 0x%x\n",
> + __func__, cpu, lvtthmr_init, tmp, l);
>
> if ((l & MSR_IA32_MISC_ENABLE_TM1) && (h & APIC_DM_SMI)) {
> if (system_state == SYSTEM_BOOTING)
> - pr_debug("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> + pr_info("CPU%d: Thermal monitoring handled by SMI\n", cpu);
> return;
> }
>
>

Attachment: dmesglog.5.12.info.context.1
Description: Unix manual page