Hello,
On Mon, Jul 11, 2016 at 01:32:11PM -0400, Waiman Long wrote:
The percpu APIs are extensively used in the Linux kernel to reduceI understand that there's a trade-off between local access and global
cacheline contention and improve performance. For some use cases, the
percpu APIs may be too fine-grain for distributed resources whereas
a per-node based allocation may be too coarse as we can have dozens
of CPUs in a NUMA node in some high-end systems.
This patch introduces a simple per-subnode APIs where each of the
distributed resources will be shared by only a handful of CPUs within
a NUMA node. The per-subnode APIs are built on top of the percpu APIs
and hence requires the same amount of memory as if the percpu APIs
are used. However, it helps to reduce the total number of separate
resources that needed to be managed. As a result, it can speed up code
that need to iterate all the resources compared with using the percpu
APIs. Cacheline contention, however, will increases slightly as each
resource is shared by more than one CPU. As long as the number of CPUs
in each subnode is small, the performance impact won't be significant.
In this patch, at most 2 sibling groups can be put into a subnode. For
an x86-64 CPU, at most 4 CPUs will be in a subnode when HT is enabled
and 2 when it is not.
traversing and you're trying to find a sweet spot between the two, but
this seems pretty arbitrary. What's the use case? What are the
numbers? Why are global traversals often enough to matter so much?