Re: [PATCH] numa, mem-hotplug: Fix stack overflow in numa when setingkernel nodes to unhotpluggable.

From: Tang Chen
Date: Mon Jan 27 2014 - 22:31:28 EST


On 01/28/2014 10:55 AM, Dave Jones wrote:
On Tue, Jan 28, 2014 at 09:01:25AM +0800, Tang Chen wrote:
> On 01/28/2014 08:32 AM, David Rientjes wrote:
> > On Wed, 22 Jan 2014, David Rientjes wrote:
> >
> >>> arch/x86/mm/numa.c | 2 +-
> >>> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>>
> >>> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> >>> index 81b2750..ebefeb7 100644
> >>> --- a/arch/x86/mm/numa.c
> >>> +++ b/arch/x86/mm/numa.c
> >>> @@ -562,10 +562,10 @@ static void __init numa_init_array(void)
> >>> }
> >>> }
> >>>
> >>> +static nodemask_t numa_kernel_nodes __initdata;
> >>> static void __init numa_clear_kernel_node_hotplug(void)
> >>> {
> >>> int i, nid;
> >>> - nodemask_t numa_kernel_nodes;
> >>> unsigned long start, end;
> >>> struct memblock_type *type =&memblock.reserved;
> >>>
> >>
> >> Isn't this also a bugfix since you never initialize numa_kernel_nodes when
> >> it's allocated on the stack with NODE_MASK_NONE?
> >>
> >
> > This hasn't been answered and the patch still isn't in linux-kernel yet
> > Dave tested it as good. I'm suspicious of the changelog that indicates
> > this nodemask is the result of a stack overflow itself which only manages
> > to reproduce itself in the init patch slightly more than 50% of the time.
> > How is that possible?
> >
> > I think the changelog should indicate this also fixes an uninitialized
> > nodemask issue.
>
> Hi David,
>
> I'm still working on this problem, but unfortunately nothing new for now.
> And the test till now shows no more problem here.
>
> I'm digging into it, but need more time.
>
> I'll resend a new patch and modify the changelog soon. Before we find the
> root cause, I think we can use this patch as a temporary solution.

Ok, I hit the 2nd bug again (oops in next_zones_zonelist...)

I did a bisect with the patch above applied each step of the way.
This time I got a plausible looking result....


a0acda917284183f9b71e2d08b0aa0aea722b321 is the first bad commit
commit a0acda917284183f9b71e2d08b0aa0aea722b321
Author: Tang Chen<tangchen@xxxxxxxxxxxxxx>
Date: Tue Jan 21 15:49:32 2014 -0800

acpi, numa, mem_hotplug: mark all nodes the kernel resides un-hotpluggable


Reverting this commit of course removes the whole function from above,
so we haven't really learned anything new, other than that commit is broken,
even after the above fix-up.

If we revert this commit, memory hot-remove won't be able to work.
Let's try to fix it before the merge window is close.


Dave


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/