Re: [PATCH] sched_ext: Separate lock and first_task into distinct cache lines in scx_dispatch_q

From: David CARLIER

Date: Sat Feb 28 2026 - 16:28:02 EST

Hi Tejun,

You're right, I got the access pattern wrong. Looking at it more
carefully, first_task is written via rcu_assign_pointer() on every
enqueue and on dequeues when the removed task is the head — all under
dsq->lock. Since the lock acquisition already brings the cache line
into exclusive state, writing first_task on the same line is
essentially free. The only lockless reader is scx_bpf_dsq_nr_queued(),
which
isn't a hot path ...

Understood on requiring experimental data going forward. I'll make
sure to back any performance-related patches with benchmark numbers
and profiling output (perf c2c / perf stat).

Sorry for the noise (again..).

On Sat, 28 Feb 2026 at 17:28, Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> On Sat, Feb 28, 2026 at 01:06:47PM +0000, David Carlier wrote:
> > lock (write-heavy) and first_task (read-mostly, lockless RCU peek) share
> > the same cache line in struct scx_dispatch_q. Every lock acquire/release
> > by a dispatching CPU invalidates the line for all CPUs performing
> > lockless first_task peeks, causing unnecessary cache coherence traffic,
> > especially across NUMA nodes.
> >
> > Add ____cacheline_aligned_in_smp to first_task to place it on its own
> > cache line, eliminating this false sharing on SMP systems. On
> > uniprocessor builds the annotation is a no-op, so no space is wasted.
> >
> > On SMP, the trade-off is increased struct size: each scx_dispatch_q
> > grows by up to ~56 bytes of padding. There are two instances embedded
> > per-CPU in scx_rq (local_dsq and bypass_dsq), plus any dynamically
> > allocated custom DSQs, so the total overhead scales with the number of
> > CPUs and active DSQs.
>
> But first_task is read-mostly. How could it be? David, from now on, I'm not
> going to apply these patches unless you provide backing experimental data.
>
> Thanks.
>
> --
> tejun