Re: [PATCH] x86: fix system without memory on node0

From: Yinghai Lu
Date: Wed May 13 2009 - 13:45:42 EST


Jack Steiner wrote:
> On Tue, May 12, 2009 at 06:34:31PM -0700, Yinghai Lu wrote:
>> Jack found that crash with doesn't have memory on node0.
>>
>> it turns out with per_cpu changeset, node_number for BSP will be alway 0,
>> and it is consistent to cpu_to_node() that is to near node already.
>> aka when numa_set_node() for node0 is called early before per_cpu area is
>> setup
>>
>> try to set the node_number for boot cpu, after we get per_cpu area setup.
>>
>> [ Impact: fix crashing on memoryless node 0]
>>
>> Reported-by: Jack Steiner <steiner@xxxxxxx>
>> Signed-off-by: Yinghai Lu <yinghai@xxxxxxxxxx>
>>
>> ---
>> arch/x86/kernel/setup_percpu.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> Index: linux-2.6/arch/x86/kernel/setup_percpu.c
>> ===================================================================
>> --- linux-2.6.orig/arch/x86/kernel/setup_percpu.c
>> +++ linux-2.6/arch/x86/kernel/setup_percpu.c
>> @@ -423,6 +423,14 @@ void __init setup_per_cpu_areas(void)
>> early_per_cpu_ptr(x86_cpu_to_node_map) = NULL;
>> #endif
>>
>> +#if defined(CONFIG_X86_64) && defined(CONFIG_NUMA)
>> + /*
>> + * make sure boot cpu node_number is right, when boot cpu is on the
>> + * node that doesn't have mem installed
>> + */
>> + per_cpu(node_number, boot_cpu_id) = cpu_to_node(boot_cpu_id);
>> +#endif
>> +
>> /* Setup node to cpumask map */
>> setup_node_to_cpumask_map();
>>
>
> With the patch above PLUS the patch below, I verified that all of our strange
> configurations boot to shell prompt & run simple commands. There are certainly
> some corner cases that have not been tested.
>
> Note that both patches are required. The system panics in early boot if either
> patch is omitted.
>
> ---
>
>
> Ignore offline nodes when building the zone lists. This
> fix is needed to support configurations that hax PXMs with
> cpus but no memory.
>
>
> Signed-off-by: Jack Steiner <steiner@xxxxxxx>
>
>
> ---
> mm/page_alloc.c | 2 ++
> 1 file changed, 2 insertions(+)
>
> Index: linux/mm/page_alloc.c
> ===================================================================
> --- linux.orig/mm/page_alloc.c 2009-05-12 17:06:59.000000000 -0500
> +++ linux/mm/page_alloc.c 2009-05-13 09:54:09.000000000 -0500
> @@ -2370,6 +2370,8 @@ static void build_zonelists(pg_data_t *p
> * If another node is sufficiently far away then it is better
> * to reclaim pages in a zone before going off node.
> */
> + if (!node_online(node))
> + continue;
> if (distance > RECLAIM_DISTANCE)
> zone_reclaim_mode = 1;
>

that means that node_states[N_HIGH_MEMORY] is still not right.

and it should be done by
/*
* early_calculate_totalpages()
* Sum pages in active regions for movable zone.
* Populate N_HIGH_MEMORY for calculating usable_nodes.
*/
static unsigned long __init early_calculate_totalpages(void)
{
int i;
unsigned long totalpages = 0;

for (i = 0; i < nr_nodemap_entries; i++) {
unsigned long pages = early_node_map[i].end_pfn -
early_node_map[i].start_pfn;
totalpages += pages;
if (pages)
node_set_state(early_node_map[i].nid, N_HIGH_MEMORY);
}
return totalpages;
}


also

void __init free_area_init_nodes(unsigned long *max_zone_pfn)
...



somehow that is broken?

YH
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/