Re: Early boot regression from f0551af0213 ("x86/topology: Ignore non-present APIC IDs in a present package")

From: Thomas Gleixner
Date: Thu Apr 25 2024 - 17:42:54 EST


Lyude!

On Thu, Apr 25 2024 at 11:56, Lyude Paul wrote:
> On Thu, 2024-04-25 at 04:11 +0200, Thomas Gleixner wrote:
>>
>> Can you please boot a kernel with the commit in question reverted and
>> add 'possible_cpus=8' to the kernel command line?
>>
>> In theory this should fail too.
>
> Yep - tried booting a kernel with f0551af0213 reverted and
> possible_cpus=8, it definitely looks like that crashes things as well
> in the same way.

Good. That means it's a problem which existed before but went unnoticed.

> Also - it scrolled off the screen before I had a chance to write it
> down, but I'm -fairly- sure I saw some sort of complaint about "16 [or
> some double digit number] processors exceeds max number of 8". Which
> is quite interesting, as this is definitely just a quad core ryzen
> processor with hyperthreading - so there should only be 8 threads.

Right, that's what we saw with the debug patch. The ACPI/MADT table
is clearly bonkers. The effect of it is that it pretends that the system
has 16 possible CPUs:

[ 0.089381] CPU topo: Allowing 8 present CPUs plus 8 hotplug CPUs

Which in turn changes the sizing of the per CPU data and affects some
other details which depend on the number of possible CPUs.

But that should not matter at all because the system scaling should be
sufficient with 8 CPUs, but it does not for some completely non-obvious
reasons.

Can you please try to increase possible_cpus=N on the command line one
by one and check when it actually starts to "work" again.

One other thing to try is to boot with 'possible_cpus=8' and
'intremap=off' and see whether that makes a difference.

I really have no idea where to look and not having the early boot
messages in case of the fail is not helpful as I can't add meaningful
debug to it.

I just checked: the motherboard has a serial port, so it would be
extremly helpful to hook up a serial cable to this thing and enable
serial console on the kernel command line. That way we might eventually
see information which is emitted before it fails to validate the timer
interrupt.

Thanks,

tglx