Re: [REPORT] Softlockups on PowerNV with upstream
From: Gavin Shan
Date: Thu Apr 10 2025 - 01:35:39 EST
On 4/10/25 3:25 PM, Oscar Salvador wrote:
On Wed, Apr 09, 2025 at 11:33:44PM +0530, Aditya Gupta wrote:
Hi,
While booting current upstream kernel, I consistently get "softlockups", on IBM PowerNV system.
I have tested it only on PowerNV systems. But some architectures/platforms also
might have it. PSeries systems don't have this issue though.
Bisect points to the following commit:
commit 61659efdb35ce6c6ac7639342098f3c4548b794b
Author: Gavin Shan <gshan@xxxxxxxxxx>
Date: Wed Mar 12 09:30:43 2025 +1000
drivers/base/memory: improve add_boot_memory_block()
...
Console log
-----------
[ 2.783371] smp: Brought up 4 nodes, 256 CPUs
[ 2.783475] numa: Node 0 CPUs: 0-63
[ 2.783537] numa: Node 2 CPUs: 64-127
[ 2.783591] numa: Node 4 CPUs: 128-191
[ 2.783653] numa: Node 6 CPUs: 192-255
[ 2.804945] Memory: 735777792K/738197504K available (17536K kernel code, 5760K rwdata, 15232K rodata, 6528K init, 2517K bss, 1369664K reserved, 0K cma-reserved)
If I am not mistaken this is ~700GB, and PowerNV uses 16MB as section size,
and sections_per_block == 1 (I think).
The code before the mentioned commit, was something like:
for (nr = base_section_nr; nr < base_section_nr + sections_per_block; nr++)
if (present_section_nr(nr))
section_count++;
if (section_count == 0)
return 0;
return add_memory_block()
So, in case of PowerNV , we will just check one section at a time and
either return or call add_memory_block depending whether it is present.
Now, with the current code that is something different.
We now have
memory_dev_init:
for(nr = 0, nr <= __highest_present_section_nr; nr += 1)
ret = add_boot_memory_block
add_boot_memory_block:
for_each_present_section_nr(base_section_nr, nr) {
if (nr >= (base_section_nr + sections_per_block))
break;
return add_memory_block();
}
return 0;
The thing is that next_present_section_nr() (which is called in
for_each_present_section_nr()) will loop until we find a present
section.
And then we will check whether the found section is beyond
base_section_nr + sections_per_block (where sections_per_block = 1).
If so, we skip add_memory_block.
Now, I think that the issue comes from for_each_present_section_nr
having to loop a lot until we find a present section.
And then the loop in memory_dev_init increments only by 1, which means
that the next iteration we might have to loop a lot again to find the
another present section. And so on and so forth.
Maybe we can fix this by making memory_dev_init() remember in which
section add_boot_memory_block returns.
Something like the following (only compile-tested)
Thanks, Oscar. You're correct that the overhead is introduced by for_each_present_section_nr().
I already had the fix, working on IBM's Power9 machine, where the issue can be
reproduced. Please see the attached patch.
I'm having most tests on ARM64 machine for the fix.
Thanks,
Gavin
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 8f3a41d9bfaa..d97635cbfd1d 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -816,18 +816,25 @@ static int add_memory_block(unsigned long block_id, unsigned long state,
return 0;
}
-static int __init add_boot_memory_block(unsigned long base_section_nr)
+static int __init add_boot_memory_block(unsigned long *base_section_nr)
{
+ int ret;
unsigned long nr;
- for_each_present_section_nr(base_section_nr, nr) {
- if (nr >= (base_section_nr + sections_per_block))
+ for_each_present_section_nr(*base_section_nr, nr) {
+ if (nr >= (*base_section_nr + sections_per_block))
break;
- return add_memory_block(memory_block_id(base_section_nr),
- MEM_ONLINE, NULL, NULL);
+ ret = add_memory_block(memory_block_id(*base_section_nr),
+ MEM_ONLINE, NULL, NULL);
+ *base_section = nr;
+ return ret;
}
+ if (nr == -1)
+ *base_section = __highest_present_section_nr + 1;
+ else
+ *base_section = nr;
return 0;
}
@@ -973,9 +980,9 @@ void __init memory_dev_init(void)
* Create entries for memory sections that were found
* during boot and have been initialized
*/
- for (nr = 0; nr <= __highest_present_section_nr;
- nr += sections_per_block) {
- ret = add_boot_memory_block(nr);
+ nr = first_present_section_nr();
+ for (; nr <= __highest_present_section_nr; nr += sections_per_block) {
+ ret = add_boot_memory_block(&nr);
if (ret)
panic("%s() failed to add memory block: %d\n", __func__,
ret);
@Aditya: can you please give it a try?
From d4c43d5f6b962144c4f47d46a66284df92da285e Mon Sep 17 00:00:00 2001
From: Gavin Shan <gshan@xxxxxxxxxx>
Date: Thu, 10 Apr 2025 14:43:46 +1000
Subject: [PATCH] drivers/base/memory: Avoid overhead
for_each_present_section_nr()
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
for_each_present_section_nr() was introduced to add_boot_memory_block()
by commit 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()").
It causes unnecessary overhead when the present sections are really
sparse. next_present_section_nr() called by the macro finds the next
present section, which is far away from the spanning sections in the
specified block. Too much time consumed by next_present_section_nr()
in this case, which can lead to softlockup as observed by Aditya Gupta
on IBM Power10 machine.
watchdog: BUG: soft lockup - CPU#248 stuck for 22s! [swapper/248:1]
Modules linked in:
CPU: 248 UID: 0 PID: 1 Comm: swapper/248 Not tainted 6.15.0-rc1-next-20250408 #1 VOLUNTARY
Hardware name: 9105-22A POWER10 (raw) 0x800200 opal:v7.1-107-gfda75d121942 PowerNV
NIP: c00000000209218c LR: c000000002092204 CTR: 0000000000000000
REGS: c00040000418fa30 TRAP: 0900 Not tainted (6.15.0-rc1-next-20250408)
MSR: 9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE> CR: 28000428 XER: 00000000
CFAR: 0000000000000000 IRQMASK: 0
GPR00: c000000002092204 c00040000418fcd0 c000000001b08100 0000000000000040
GPR04: 0000000000013e00 c000c03ffebabb00 0000000000c03fff c000400fff587f80
GPR08: 0000000000000000 00000000001196f7 0000000000000000 0000000028000428
GPR12: 0000000000000000 c000000002e80000 c00000000001007c 0000000000000000
GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR24: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
GPR28: c000000002df7f70 0000000000013dc0 c0000000011dd898 0000000008000000
NIP [c00000000209218c] memory_dev_init+0x114/0x1e0
LR [c000000002092204] memory_dev_init+0x18c/0x1e0
Call Trace:
[c00040000418fcd0] [c000000002092204] memory_dev_init+0x18c/0x1e0 (unreliable)
[c00040000418fd50] [c000000002091348] driver_init+0x78/0xa4
[c00040000418fd70] [c0000000020063ac] kernel_init_freeable+0x22c/0x370
[c00040000418fde0] [c0000000000100a8] kernel_init+0x34/0x25c
[c00040000418fe50] [c00000000000cd94] ret_from_kernel_user_thread+0x14/0x1c
Avoid the overhead by folding for_each_present_section_nr() to the outer
loop. add_boot_memory_block() is dropped after that.
Fixes: 61659efdb35c ("drivers/base/memory: improve add_boot_memory_block()")
Closes: https://lore.kernel.org/linux-mm/20250409180344.477916-1-adityag@xxxxxxxxxxxxx
Reported-by: Aditya Gupta <adityag@xxxxxxxxxxxxx>
Signed-off-by: Gavin Shan <gshan@xxxxxxxxxx>
---
drivers/base/memory.c | 34 ++++++++++++----------------------
1 file changed, 12 insertions(+), 22 deletions(-)
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 8f3a41d9bfaa..433a5fe96304 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -816,21 +816,6 @@ static int add_memory_block(unsigned long block_id, unsigned long state,
return 0;
}
-static int __init add_boot_memory_block(unsigned long base_section_nr)
-{
- unsigned long nr;
-
- for_each_present_section_nr(base_section_nr, nr) {
- if (nr >= (base_section_nr + sections_per_block))
- break;
-
- return add_memory_block(memory_block_id(base_section_nr),
- MEM_ONLINE, NULL, NULL);
- }
-
- return 0;
-}
-
static int add_hotplug_memory_block(unsigned long block_id,
struct vmem_altmap *altmap,
struct memory_group *group)
@@ -957,7 +942,7 @@ static const struct attribute_group *memory_root_attr_groups[] = {
void __init memory_dev_init(void)
{
int ret;
- unsigned long block_sz, nr;
+ unsigned long block_sz, block_id, nr;
/* Validate the configured memory block size */
block_sz = memory_block_size_bytes();
@@ -973,12 +958,17 @@ void __init memory_dev_init(void)
* Create entries for memory sections that were found
* during boot and have been initialized
*/
- for (nr = 0; nr <= __highest_present_section_nr;
- nr += sections_per_block) {
- ret = add_boot_memory_block(nr);
- if (ret)
- panic("%s() failed to add memory block: %d\n", __func__,
- ret);
+ block_id = ULONG_MAX;
+ for_each_present_section_nr(0, nr) {
+ if (block_id != ULONG_MAX && memory_block_id(nr) == block_id)
+ continue;
+
+ block_id = memory_block_id(nr);
+ ret = add_memory_block(block_id, MEM_ONLINE, NULL, NULL);
+ if (ret) {
+ panic("%s() failed to add memory block: %d\n",
+ __func__, ret);
+ }
}
}
--
2.48.1