Re: [PATCH v2 1/4] sched/rt: Optimize cpupri_vec layout to mitigate cache line contention
From: Peter Zijlstra
Date: Thu Apr 02 2026 - 06:43:39 EST
On Fri, Mar 27, 2026 at 10:17:13AM +0000, Deng, Pan wrote:
> >
> > On Tue, Mar 24, 2026 at 09:36:14AM +0000, Deng, Pan wrote:
> >
> > > Regarding this patch, yes, using cacheline aligned could increase potential
> > > memory usage.
> > > After internal discussion, we are thinking of an alternative method to
> > > mitigate the waste of memory usage, that is, using kmalloc() to allocate
> > > count in a different memory space rather than placing the count and
> > > cpumask together in this structure. The rationale is that, writing to
> > > address pointed by the counter and reading the address from cpumask
> > > is isolated in different memory space which could reduce the ratio of
> > > cache false sharing, besides, kmalloc() based on slub/slab could place
> > > the objects in different cache lines to reduce the cache contention.
> > > The drawback of dynamic allocation counter is that, we have to maintain
> > > the life cycle of the counters.
> > > Could you please advise if sticking with current cache_align attribute
> > > method or using kmalloc() is preferred?
> >
> > Well, you'd have to allocate a full cacheline anyway. If you allocate N
> > 4 byte (counter) objects, there's a fair chance they end up in the same
> > cacheline (its a SLAB after all) and then you're back to having a ton of
> > false sharing.
> >
> > Anyway, for you specific workload, why isn't partitioning a viable
> > solution? It would not need any kernel modifications and would get rid
> > of the contention entirely.
>
> Thank you very much for pointing this out.
>
> We understand cpuset partitioning would eliminate the contention.
> However, in managed container platforms (e.g., Kubernetes), users can
> obtain RT capabilities for their workloads via CAP_SYS_NICE, but they
> don't have host-level privileges to create cpuset partitions.
So because Kubernetes is shit, you're going to patch the kernel? Isn't
that backwards? Should you not instead try and fix this kubernetes
thing?