Re: linux-next regression: SNP Guest boot hangs with certain cpu/mem config combination

From: Aithal, Srikanth
Date: Fri Mar 28 2025 - 05:19:46 EST


On 3/28/2025 2:39 PM, Kirill A. Shutemov wrote:
On Fri, Mar 28, 2025 at 10:28:19AM +0200, Kirill A. Shutemov wrote:
On Thu, Mar 27, 2025 at 07:39:22PM +0200, Kirill A. Shutemov wrote:
On Thu, Mar 27, 2025 at 11:02:24AM -0400, Steven Rostedt wrote:
On Thu, 27 Mar 2025 16:43:43 +0200
"Kirill A. Shutemov" <kirill.shutemov@xxxxxxxxxxxxxxx> wrote:

The only option I see so far is to drop static branch from this path.

But I am not sure if it the only case were we use static branch from CPU
hotplug callbacks.

Any other ideas?


Hmmm, didn't take too close a look here, but there is the
static_key_slow_dec_cpuslocked() variant, would that work here? Is the issue
the caller may or may not have the cpu_hotplug lock?

Yes. This is generic page alloc path and it can be called with and without
the lock.

Note, it's not the static_branch that is an issue, it's enabling/disabling
the static branch that is. Changing a static branch takes a bit of work as
it does modify the kernel text.

Is it possible to delay the update via a workqueue?

Ah. Good point. Should work. I'll give it try.

The patch below fixes problem for me.

Ah. No, it won't work. We can get there before workqueues are initialized:
mm_core_init() is called before workqueue_init_early().

We cannot queue a work. :/

Steven, any other ideas?


I have booted the guest with different memory and CPU combinations and have not seen any failures with the fix so far. Are there any other scenarios that could trigger the above case? Please let me know.