Re: [RFC Patch V1 00/30] Enable memoryless node on x86 platforms

From: Jiang Liu
Date: Thu Jul 24 2014 - 21:50:18 EST

On 2014/7/25 7:32, Nishanth Aravamudan wrote:
> On 23.07.2014 [16:20:24 +0800], Jiang Liu wrote:
>> On 2014/7/22 1:57, Nishanth Aravamudan wrote:
>>> On 21.07.2014 [10:41:59 -0700], Tony Luck wrote:
>>>> On Mon, Jul 21, 2014 at 10:23 AM, Nishanth Aravamudan
>>>> <nacc@xxxxxxxxxxxxxxxxxx> wrote:
>>>>> It seems like the issue is the order of onlining of resources on a
>>>>> specific x86 platform?
>>>> Yes. When we online a node the BIOS hits us with some ACPI hotplug events:
>>>> First: Here are some new cpus
>>> Ok, so during this period, you might get some remote allocations. Do you
>>> know the topology of these CPUs? That is they belong to a
>>> (soon-to-exist) NUMA node? Can you online that currently offline NUMA
>>> node at this point (so that NODE_DATA()) resolves, etc.)?
>> Hi Nishanth,
>> We have method to get the NUMA information about the CPU, and
>> patch "[RFC Patch V1 30/30] x86, NUMA: Online node earlier when doing
>> CPU hot-addition" tries to solve this issue by onlining NUMA node
>> as early as possible. Actually we are trying to enable memoryless node
>> as you have suggested.
> Ok, it seems like you have two sets of patches then? One is to fix the
> NUMA information timing (30/30 only). The rest of the patches are
> general discussions about where cpu_to_mem() might be used instead of
> cpu_to_node(). However, based upon Tejun's feedback, it seems like
> rather than force all callers to use cpu_to_mem(), we should be looking
> at the core VM to ensure fallback is occuring appropriately when
> memoryless nodes are present.
> Do you have a specific situation, once you've applied 30/30, where
> kmalloc_node() leads to an Oops?
Hi Nishanth,
After following the two threads related to support of memoryless
node and digging more code, I realized my first version path set is an
overkill. As Tejun has pointed out, we shouldn't expose the detail of
memoryless node to normal user, but there are still some special users
who need the detail. So I have tried to summarize it as:
1) Arch code should online corresponding NUMA node before onlining any
CPU or memory, otherwise it may cause invalid memory access when
accessing NODE_DATA(nid).
2) For normal memory allocations without __GFP_THISNODE setting in the
gfp_flags, we should prefer numa_node_id()/cpu_to_node() instead of
numa_mem_id()/cpu_to_mem() because the latter loses hardware topology
information as pointed out by Tejun:
A - B - X - C - D
Where X is the memless node. numa_mem_id() on X would return
either B or C, right? If B or C can't satisfy the allocation,
the allocator would fallback to A from B and D for C, both of
which aren't optimal. It should first fall back to C or B
respectively, which the allocator can't do anymoe because the
information is lost when the caller side performs numa_mem_id().
3) For memory allocation with __GFP_THISNODE setting in gfp_flags,
numa_node_id()/cpu_to_node() should be used if caller only wants to
allocate from local memory, otherwise numa_mem_id()/cpu_to_mem()
should be used if caller wants to allocate from the nearest node.
4) numa_mem_id()/cpu_to_mem() should be used if caller wants to check
whether a page is allocated from the nearest node.

And my v2 patch set is based on above rules.
Any suggestions here?

> Thanks,
> Nish
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at