Re: [PATCH] sched_ext: optimize sched_ext_entity layout for cache locality

From: David CARLIER

Date: Tue Feb 24 2026 - 13:24:22 EST

David CARLIER <devnexen@xxxxxxxxx>

18:13 (8 minutes ago)

to Tejun, David, linux-kernel
Thanks for merging. I honestly haven't run a formal benchmark yet but
here is the pahole output before and after the patch, both built from
the same tree (compiled with make kernel/sched/core.o which pulls in
sched_ext_entity via
sched/ext.h):

Before:

struct sched_ext_entity {
struct scx_dispatch_q * dsq; /* 0 8 */
struct scx_dsq_list_node dsq_list; /* 8 24 */
struct rb_node dsq_priq; /* 32 24 */
u32 dsq_seq; /* 56 4 */
u32 dsq_flags; /* 60 4 */
/* --- cacheline 1 boundary (64 bytes) --- */
u32 flags; /* 64 4 */
u32 weight; /* 68 4 */
s32 sticky_cpu; /* 72 4 */
s32 holding_cpu; /* 76 4 */
s32 selected_cpu; /* 80 4 */
u32 kf_mask; /* 84 4 */
struct task_struct * kf_tasks[2]; /* 88 16 */
atomic_long_t ops_state; /* 104 8 */
struct list_head runnable_node; /* 112 16 */
/* --- cacheline 2 boundary (128 bytes) --- */
long unsigned int runnable_at; /* 128 8 */
u64 core_sched_at; /* 136 8 */
u64 ddsp_dsq_id; /* 144 8 */
u64 ddsp_enq_flags; /* 152 8 */
u64 slice; /* 160 8 */
u64 dsq_vtime; /* 168 8 */
bool disallow; /* 176 1 */

/* XXX 7 bytes hole, try to pack */

struct cgroup * cgrp_moving_from; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct list_head tasks_node; /* 192 16 */

/* size: 208, cachelines: 4, members: 23 */
/* sum members: 201, holes: 1, sum holes: 7 */
};

dsq sits at offset 0 (cacheline 0), ops_state at offset 104
(cacheline 1), and ddsp_dsq_id/ddsp_enq_flags at offsets 144-152
(cacheline 2) — three cache lines touched on every do_enqueue_task(),
finish_dispatch(), and direct_dispatch() call.

After:

struct sched_ext_entity {
struct scx_dispatch_q * dsq; /* 0 8 */
atomic_long_t ops_state; /* 8 8 */
u64 ddsp_dsq_id; /* 16 8 */
u64 ddsp_enq_flags; /* 24 8 */
struct scx_dsq_list_node dsq_list; /* 32 24 */
struct rb_node dsq_priq; /* 56 24 */
/* --- cacheline 1 boundary (64 bytes) was 16 bytes ago --- */
u32 dsq_seq; /* 80 4 */
u32 dsq_flags; /* 84 4 */
u32 flags; /* 88 4 */
u32 weight; /* 92 4 */
s32 sticky_cpu; /* 96 4 */
s32 holding_cpu; /* 100 4 */
s32 selected_cpu; /* 104 4 */
u32 kf_mask; /* 108 4 */
struct task_struct * kf_tasks[2]; /* 112 16 */
/* --- cacheline 2 boundary (128 bytes) --- */
struct list_head runnable_node; /* 128 16 */
long unsigned int runnable_at; /* 144 8 */
u64 core_sched_at; /* 152 8 */
u64 slice; /* 160 8 */
u64 dsq_vtime; /* 168 8 */
bool disallow; /* 176 1 */

/* XXX 7 bytes hole, try to pack */

struct cgroup * cgrp_moving_from; /* 184 8 */
/* --- cacheline 3 boundary (192 bytes) --- */
struct list_head tasks_node; /* 192 16 */

/* size: 208, cachelines: 4, members: 23 */
/* sum members: 201, holes: 1, sum holes: 7 */
};

All four hot-path fields now sit within the first 32 bytes of
cacheline 0. Struct size and total cacheline count are unchanged (208
bytes, 4 cachelines) — it is purely a field reorder.

If you want, I might follow up with perf stat cache-miss numbers
(hackbench/schbench under scx_simple) once I can test on appropriate
hardware.

On Tue, 24 Feb 2026 at 17:43, Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> On Tue, Feb 24, 2026 at 05:56:37AM +0000, David Carlier wrote:
> > Reorder struct sched_ext_entity to place ops_state, ddsp_dsq_id, and
> > ddsp_enq_flags immediately after dsq. These fields are accessed together
> > in the do_enqueue_task() and finish_dispatch() hot paths but were
> > previously spread across three different cache lines. Grouping them on
> > the same cache line reduces cache misses on every enqueue and dispatch
> > operation.
> >
> > Signed-off-by: David Carlier <devnexen@xxxxxxxxx>
>
> Were you able to measure any different by any chance?
>
> Thanks.
>
> --
> tejun