Re: [REPORT] Softlockups on PowerNV with upstream
From: Aditya Gupta
Date: Thu Apr 10 2025 - 07:45:32 EST
Hi,
On 25/04/10 07:25AM, Oscar Salvador wrote:
> On Wed, Apr 09, 2025 at 11:33:44PM +0530, Aditya Gupta wrote:
> > Hi,
> >
> > While booting current upstream kernel, I consistently get "softlockups", on IBM PowerNV system.
> >
> > I have tested it only on PowerNV systems. But some architectures/platforms also
> > might have it. PSeries systems don't have this issue though.
> >
> > Bisect points to the following commit:
> >
> > commit 61659efdb35ce6c6ac7639342098f3c4548b794b
> > Author: Gavin Shan <gshan@xxxxxxxxxx>
> > Date: Wed Mar 12 09:30:43 2025 +1000
> >
> > drivers/base/memory: improve add_boot_memory_block()
> >
> ...
> > Console log
> > -----------
> >
> > [ 2.783371] smp: Brought up 4 nodes, 256 CPUs
> > [ 2.783475] numa: Node 0 CPUs: 0-63
> > [ 2.783537] numa: Node 2 CPUs: 64-127
> > [ 2.783591] numa: Node 4 CPUs: 128-191
> > [ 2.783653] numa: Node 6 CPUs: 192-255
> > [ 2.804945] Memory: 735777792K/738197504K available (17536K kernel code, 5760K rwdata, 15232K rodata, 6528K init, 2517K bss, 1369664K reserved, 0K cma-reserved)
>
> If I am not mistaken this is ~700GB, and PowerNV uses 16MB as section size,
> and sections_per_block == 1 (I think).
Yes, the memory is around 700G:
# lsmem
RANGE SIZE STATE REMOVABLE BLOCK
0x0000000000000000-0x0000001fffffffff 128G online yes 0-127
0x0000400000000000-0x0000400fffffffff 64G online yes 65536-65599
0x0000800000000000-0x0000803fffffffff 256G online yes 131072-131327
0x0000c00000000000-0x0000c03fffffffff 256G online yes 196608-196863
Memory block size: 1G
Total online memory: 704G
Total offline memory: 0B
I don't know about the sections_per_block.
>
> The code before the mentioned commit, was something like:
>
> for (nr = base_section_nr; nr < base_section_nr + sections_per_block; nr++)
> if (present_section_nr(nr))
> section_count++;
>
> if (section_count == 0)
> return 0;
> return add_memory_block()
>
> So, in case of PowerNV , we will just check one section at a time and
> either return or call add_memory_block depending whether it is present.
>
> Now, with the current code that is something different.
> We now have
>
> memory_dev_init:
> for(nr = 0, nr <= __highest_present_section_nr; nr += 1)
> ret = add_boot_memory_block
>
> add_boot_memory_block:
> for_each_present_section_nr(base_section_nr, nr) {
> if (nr >= (base_section_nr + sections_per_block))
> break;
>
> return add_memory_block();
> }
> return 0;
>
> The thing is that next_present_section_nr() (which is called in
> for_each_present_section_nr()) will loop until we find a present
> section.
> And then we will check whether the found section is beyond
> base_section_nr + sections_per_block (where sections_per_block = 1).
> If so, we skip add_memory_block.
>
> Now, I think that the issue comes from for_each_present_section_nr
> having to loop a lot until we find a present section.
> And then the loop in memory_dev_init increments only by 1, which means
> that the next iteration we might have to loop a lot again to find the
> another present section. And so on and so forth.
>
> Maybe we can fix this by making memory_dev_init() remember in which
> section add_boot_memory_block returns.
> Something like the following (only compile-tested)
>
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 8f3a41d9bfaa..d97635cbfd1d 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -816,18 +816,25 @@ static int add_memory_block(unsigned long block_id, unsigned long state,
> return 0;
> }
>
> -static int __init add_boot_memory_block(unsigned long base_section_nr)
> +static int __init add_boot_memory_block(unsigned long *base_section_nr)
> {
> + int ret;
> unsigned long nr;
>
> - for_each_present_section_nr(base_section_nr, nr) {
> - if (nr >= (base_section_nr + sections_per_block))
> + for_each_present_section_nr(*base_section_nr, nr) {
> + if (nr >= (*base_section_nr + sections_per_block))
> break;
>
> - return add_memory_block(memory_block_id(base_section_nr),
> - MEM_ONLINE, NULL, NULL);
> + ret = add_memory_block(memory_block_id(*base_section_nr),
> + MEM_ONLINE, NULL, NULL);
> + *base_section = nr;
> + return ret;
> }
>
> + if (nr == -1)
> + *base_section = __highest_present_section_nr + 1;
> + else
> + *base_section = nr;
> return 0;
> }
>
> @@ -973,9 +980,9 @@ void __init memory_dev_init(void)
> * Create entries for memory sections that were found
> * during boot and have been initialized
> */
> - for (nr = 0; nr <= __highest_present_section_nr;
> - nr += sections_per_block) {
> - ret = add_boot_memory_block(nr);
> + nr = first_present_section_nr();
> + for (; nr <= __highest_present_section_nr; nr += sections_per_block) {
> + ret = add_boot_memory_block(&nr);
> if (ret)
> panic("%s() failed to add memory block: %d\n", __func__,
> ret);
>
Makes sense, thanks for the nice explanation.
>
> @Aditya: can you please give it a try?
>
Yes, will try it now.
Thanks,
- Aditya G
>
>
> --
> Oscar Salvador
> SUSE Labs