Re: Memory policy question for NUMA arch....

From: Lee Schermerhorn
Date: Wed Apr 07 2010 - 13:28:12 EST

On Wed, 2010-04-07 at 08:48 -0700, Rick Sherm wrote:
> Hi Andy,
> --- On Wed, 4/7/10, Andi Kleen <andi@xxxxxxxxxxxxxx> wrote:
> > On Tue, Apr 06, 2010 at 01:46:44PM -0700, Rick Sherm wrote:
> > > On a NUMA host, if a driver calls __get_free_pages()
> > then
> > > it will eventually invoke
> > ->alloc_pages_current(..). The comment
> > > above/within alloc_pages_current() says
> > 'current->mempolicy' will be
> > > used.So what memory policy will kick-in if the driver
> > is trying to
> > > allocate some memory blocks during driver load
> > time(say from probe_one)? System-wide default
> > policy,correct?
> >
> > Actually the policy of the modprobe or the kernel boot up
> > if built in
> > (which is interleaving)
> >
> Interleaving,yup that's what I thought. I've tight control on the environment.So for one driver I need high throughput and I will use the interleaving-policy.But for the other 2-3 drivers, I need low latency.So I would like to restrict it to the local node.These are just my thoughts but I'll have to experiment and see what the numbers look like. Once I've some numbers I will post them in a few weeks.
> > >
> > > What if the driver wishes to i) stay confined to a
> > 'cpulist' OR ii) use a different mem-policy? How
> > > do I achieve this?
> > > I will choose the 'cpulist' after I am successfuly
> > able to affinitize the MSI-X vectors.
> >
> > You can do that right now by running numactl ... modprobe
> > ...
> >
> Perfect.Ok, then I'll probably write a simple user-space wrapper:
> 1)set mem-policy type depending on driver-foo-M.
> 2)load driver-foo-M.
> 3)goto 1) and repeat for other driver[s]-foo-X
> BTW - I would know before hand which adapter is placed in which slot
> and so I will be able to deduce its proximity to a Node.
> > Yes there should be probably a better way, like using a
> > policy
> > based on the affinity of the PCI device.
> >


If you want/need to use __get_free_page(), you will need to set the
current task's memory policy. If you're loading the driver from user
space, then you can set the mempolicy of the task [shell, modprobe, ...]
using numactl as you suggest above. From within the kernel, you'd need
to temporarily change current's mempolicy to what you need and then put
it back. We don't have a formal interface to do this, I think, but such
could be added.

Another option, if you just want memory on a specific node, would be to
use kmalloc(). But for a multiple page allocation, this might not be
the best method.

As to how to find the node where the adapter is attached, from user
space you can look at /sys/devices/pci<pci-bus>/<pci-dev>/numa_node.
You can also find the 'local_cpus' [hex mask] and 'local_cpulist' in the
same directory. From within the driver, you can examine dev->numa_node.
Look at 'local_cpu{s|list}_show()' to see how to find the local cpus for
a device.

Note that if your device is attached to a memoryless node on x86, this
info won't be accurate. x86 arch code removes memoryless nodes and
reassigns cpus to other nodes that do have memory. I'm not sure what it
does with the dev->numa_node info. Maybe not a problem for you.


To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at