Re: [PATCH] mm/dmapool: localize page allocations
From: Matthew Wilcox
Date: Thu May 17 2018 - 15:44:29 EST
On Thu, May 17, 2018 at 04:05:45PM -0400, Sinan Kaya wrote:
> On 5/17/2018 3:46 PM, Matthew Wilcox wrote:
> >> Remember that the CPU core that is running this driver is most probably on
> >> the same NUMA node as the device itself.
> > Umm ... says who? If my process is running on NUMA node 5 and I submit
> > an I/O, it should be allocating from a pool on node 5, not from a pool
> > on whichever node the device is attached to.
>
> OK, let's do an exercise. Maybe, I'm missing something in the big picture.
Sure.
> If a user process is running at node 5, it submits some work to the hardware
> via block layer that is eventually invoked by syscall.
>
> Whatever buffer process is using, it gets copied into the kernel space as
> it is crossing a userspace/kernel space boundary.
>
> Block layer packages a block request with the kernel pointers and makes a
> request to the NVMe driver for consumption.
>
> Last time I checked, dma_alloc_coherent() API uses the locality information
> from the device not from the CPU for allocation.
Yes, it does. I wonder why that is; it doesn't actually make any sense.
It'd be far more sensible to allocate it on memory local to the user
than memory local to the device.
> While the metadata for dma_pool is pointing to the currently running CPU core,
> the DMA buffer itself is created using the device node itself today without
> my patch.
Umm ... dma_alloc_coherent memory is for metadata about the transfer, not
for the memory used for the transaction.
> I would think that you actually want to run the process at the same NUMA node
> as the CPU and device itself for performance reasons. Otherwise, performance
> expectations should be low.
That's foolish. Consider a database appliance with four sockets, each
with its own memory and I/O devices attached. You can't tell the user
to shard the database into four pieces and have each socket only work on
the quarter of the database that's available to each socket. They may
as well buy four smaller machines. The point of buying a large NUMA
machine is to use all of it.
Let's try a different example. I have a four-socket system with one
NVMe device with lots of hardware queues. Each CPU has its own queue
assigned to it. If I allocate all the PRP metadata on the socket with
the NVMe device attached to it, I'm sending a lot of coherency traffic
in the direction of that socket, in addition to the actual data. If the
PRP lists are allocated randomly on the various sockets, the traffic
is heading all over the fabric. If the PRP lists are allocated on the
local socket, the only time those lists move off this node is when the
device requests them.