Re: [PATCH v3 0/9] Parallel CPU bringup for x86_64

From: Paul Menzel
Date: Mon Feb 14 2022 - 08:46:00 EST

Next message: Lee Jones: "Re: [PATCH] mfd: stmfx: Improve error message triggered by regulator fault in .remove()"
Previous message: Sven Schnelle: "[PATCH] ftrace: ensure trace buffer is at least 4096 bytes large"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Dear David,

Am 29.12.21 um 14:54 schrieb David Woodhouse:

On Wed, 2021-12-29 at 14:18 +0100, Paul Menzel wrote:

Or the one in
https://lore.kernel.org/lkml/d4cde50b4aab24612823714dfcbe69bc4bb63b60.camel@xxxxxxxxxxxxx

which makes it do nothing except prepare all the CPUs before bringing
them up one at a time?

I applied it on top the other one, and it made no difference either.

It's possible I missed something else in the prepare stage that doesn't
cope with all CPUs being prepared first.

My next attempt might be to change the loop in bringup_nonboot_cpus()
to bring all the CPUs not to the CPUHP_BP_PARALLEL_DYN state(s) but
instead just bring them to somewhere like CPUHP_RCUTREE_PREP, which is
somewhere in the middle between CPUHP_OFFLINE and CPUHP_BRINGUP_CPU.

Then a binary chop search — if that one boots, try maybe
CPUHP_TOPOLOGY_PREPARE. And if not, try CPUHP_PROFILE_PREPARE. Etc.

My current theory (not that I've spent that much time thinking about it
in the last week) is that there's something about the existing CPU
bringup, possibly a CPU bug or something special about the AMD CPUs,
which is triggered by just making it a little bit *faster*, which is
why bringing them up from kexec (especially in qemu) can cause it too?

Would having the serial console enabled make a difference?

Yes. I couldn't make this fail in my EC2 m6a instance (for clean boots;
I have never managed to kexec it) until I turned off the serial console
to make things go faster.

Tom seemed to find that it was in load_TR_desc(), so if you could try
this hack on a machine that doesn't magically wink out of existence on
a triplefault before even flushing its serial output, that would be
much appreciated...

Unfortunately, no more messages were printed on the serial console.

I suppose we need to litter those outputs somewhere earlier in the
trampoline then, perhaps it *isn't* getting to load_TR_desc() in your
case?

Will be back online properly next week and can actually provide some of
the above suggestions in patch form if you're willing to keep testing.

Sorry for replying so late. I saw your v4 patches, and tried commit 5e3524d21d2a () from your branch `parallel-5.17-part1`. Unfortunately, the boot problem still persists on an AMD Ryzen 3 2200 g system, I tested with. Please tell, where I should report these results too (here or posted v4 patches).

Also, do you have (physical) access to a system with an AMD CPU? If not, maybe we can get you one, so it’s more convenient for you to test.

Kind regards,

Paul

Next message: Lee Jones: "Re: [PATCH] mfd: stmfx: Improve error message triggered by regulator fault in .remove()"
Previous message: Sven Schnelle: "[PATCH] ftrace: ensure trace buffer is at least 4096 bytes large"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]