On Wed, 2021-12-29 at 14:18 +0100, Paul Menzel wrote:
Or the one in
https://lore.kernel.org/lkml/d4cde50b4aab24612823714dfcbe69bc4bb63b60.camel@xxxxxxxxxxxxx
which makes it do nothing except prepare all the CPUs before bringing
them up one at a time?
I applied it on top the other one, and it made no difference either.
It's possible I missed something else in the prepare stage that doesn't
cope with all CPUs being prepared first.
My next attempt might be to change the loop in bringup_nonboot_cpus()
to bring all the CPUs not to the CPUHP_BP_PARALLEL_DYN state(s) but
instead just bring them to somewhere like CPUHP_RCUTREE_PREP, which is
somewhere in the middle between CPUHP_OFFLINE and CPUHP_BRINGUP_CPU.
Then a binary chop search — if that one boots, try maybe
CPUHP_TOPOLOGY_PREPARE. And if not, try CPUHP_PROFILE_PREPARE. Etc.
My current theory (not that I've spent that much time thinking about it
in the last week) is that there's something about the existing CPU
bringup, possibly a CPU bug or something special about the AMD CPUs,
which is triggered by just making it a little bit *faster*, which is
why bringing them up from kexec (especially in qemu) can cause it too?
Would having the serial console enabled make a difference?
Yes. I couldn't make this fail in my EC2 m6a instance (for clean boots;
I have never managed to kexec it) until I turned off the serial console
to make things go faster.
Tom seemed to find that it was in load_TR_desc(), so if you could try
this hack on a machine that doesn't magically wink out of existence on
a triplefault before even flushing its serial output, that would be
much appreciated...
Unfortunately, no more messages were printed on the serial console.
I suppose we need to litter those outputs somewhere earlier in the
trampoline then, perhaps it *isn't* getting to load_TR_desc() in your
case?
Will be back online properly next week and can actually provide some of
the above suggestions in patch form if you're willing to keep testing.