Re: [RFC PATCH v2 6/7] lib/persubnode: Introducing a simple per-subnode APIs

From: Tejun Heo
Date: Tue Jul 12 2016 - 10:27:37 EST


Hello,

On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
> The percpu APIs are extensively used in the Linux kernel to reduce
> cacheline contention and improve performance. For some use cases, the
> percpu APIs may be too fine-grain for distributed resources whereas
> a per-node based allocation may be too coarse as we can have dozens
> of CPUs in a NUMA node in some high-end systems.
>
> This patch introduces a simple per-subnode APIs where each of the
> distributed resources will be shared by only a handful of CPUs within
> a NUMA node. The per-subnode APIs are built on top of the percpu APIs
> and hence requires the same amount of memory as if the percpu APIs
> are used. However, it helps to reduce the total number of separate
> resources that needed to be managed. As a result, it can speed up code
> that need to iterate all the resources compared with using the percpu
> APIs. Cacheline contention, however, will increases slightly as each
> resource is shared by more than one CPU. As long as the number of CPUs
> in each subnode is small, the performance impact won't be significant.
>
> In this patch, at most 2 sibling groups can be put into a subnode. For
> an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
> and 2 when it is not.

I understand that there's a trade-off between local access and global
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary. What's the use case? What are the
numbers? Why are global traversals often enough to matter so much?

Thanks.

--
tejun