Re: [PATCH 0/3] have pooled sunrpc services make more intelligentallocations

From: Jeff Layton
Date: Tue Jun 03 2008 - 13:43:19 EST


On Tue, 03 Jun 2008 11:53:42 -0500
Tom Tucker <tom@xxxxxxxxxxxxxxxxxxxxx> wrote:

> Jeff:
>
> This brings up an interesting issue with the RDMA transport and
> RDMA_READ. RDMA_READ is submitted as part of fetching an RPC from the
> client (e.g. NFS_WRITE). The xpo_recvfrom function doesn't block waiting
> for the RDMA_READ to complete, but rather queues the RPC for subsequent
> processing when the I/O completes and returns 0.
>
> I can use these new services to allocate CPU local pages for this I/O.
> So far, so good. However, when the I/O completes, and the transport is
> rescheduled for subsequent RPC completion processing, the pool/CPU that
> is elected doesn't have any affinity for the CPU on which the I/O was
> initially submitted. I think this means that the svc_process/reply steps
> may occur on a CPU far away from the memory in which the data resides.
>
> Am I making sense here? If so, any thoughts on what could/should be
> done?
>
> Thanks,
> Tom
>

I confess I didn't think hard about the RDMA case here (and haven't
been paying as much attention as I probably should to the design of
it). So take my thoughts with a large chunk of salt...

On a NUMA box, the pages have to live _somewhere_ and some CPUs will be
closer to them than others. If we're concerned about making sure that
the post-RDMA_READ processing is done on a CPU close to the memory,
then we don't have much choice but to try to make sure that this
processing is only done on CPUs that are close to that memory.

Assuming that this post-processing is done by nfsd, I suppose we'd need
to tag the post-RDMA_READ RPC with a poolid or something and make sure
that only nfsds running on CPUs close to the memory pick it up. Perhaps
there could be a per-pool queue for these RPC's or something...

Either way, the big question is whether that will be a net win or loss
for throughput. i.e. are we better off waiting for the right nfsd to
become available or allowing the first nfsd that becomes available to
make the crosscalls needed to do the RPC? It's hard to say...

In the near term, I doubt this patchset will harm the RDMA case. After
all, the distribution of memory allocations is pretty lumpy now. On
a NUMA box with RDMA you're probably doing a lot of crosscalls with
the current code.

--
Jeff Layton <jlayton@xxxxxxxxxx>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/