[PATCH] sched_ext: add unlikely() hints in do_enqueue_task() hot path

From: David Carlier

Date: Thu Feb 26 2026 - 12:57:36 EST


Add unlikely() branch hints to the error/bypass checks in
do_enqueue_task() that are rarely taken during normal operation:
offline CPU, bypass mode, exiting task, and migration-disabled task.

Signed-off-by: David Carlier <devnexen@xxxxxxxxx>
---
CLAUDE.md | 158 +++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/ext.c | 12 ++--
2 files changed, 164 insertions(+), 6 deletions(-)
create mode 100644 CLAUDE.md

diff --git a/CLAUDE.md b/CLAUDE.md
new file mode 100644
index 000000000000..e892eeea804e
--- /dev/null
+++ b/CLAUDE.md
@@ -0,0 +1,158 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Important Rules
+
+- **Do NOT modify code directly.** Only analyze, explain, and suggest changes. The user writes all code themselves.
+
+## Repository Overview
+
+This is the Linux kernel tree with the **sched_ext** subsystem — a BPF-based extensible scheduler class that allows scheduling policies to be implemented as BPF programs and loaded/unloaded at runtime. The kernel falls back to the default fair-class scheduler on any error or when the BPF scheduler exits.
+
+## Build Commands
+
+### Kernel (requires CONFIG_SCHED_CLASS_EXT=y)
+```bash
+# Required Kconfig options:
+# CONFIG_BPF=y CONFIG_SCHED_CLASS_EXT=y CONFIG_BPF_SYSCALL=y
+# CONFIG_BPF_JIT=y CONFIG_DEBUG_INFO_BTF=y
+make -j$(nproc)
+```
+
+### Example BPF schedulers (tools/sched_ext/)
+```bash
+make -j$(nproc) -C tools/sched_ext # build all
+make -C tools/sched_ext scx_simple # build one scheduler
+make -C tools/sched_ext clean
+```
+Output goes to `tools/sched_ext/build/bin/`. Requires clang >= 16, pahole >= 1.25. The build auto-generates `vmlinux.h` from the first available vmlinux (kernel tree root, `/sys/kernel/btf/vmlinux`, or `/boot/vmlinux-$(uname -r)`).
+
+### Selftests (tools/testing/selftests/sched_ext/)
+```bash
+make -j$(nproc) -C tools/testing/selftests/sched_ext
+# Run all tests:
+tools/testing/selftests/sched_ext/runner
+```
+
+## Architecture
+
+### Kernel-side (kernel/sched/)
+- **`ext.c`** — Core sched_ext implementation: BPF scheduler loading/unloading, dispatch queue (DSQ) management, all `scx_bpf_*` kfunc helpers callable from BPF
+- **`ext_idle.c`** — Built-in idle CPU tracking and selection (per-node/global idle cpumasks)
+- **`ext_internal.h`** — Internal data structures: `struct scx_dispatch_q`, task states, exit codes, config flags
+- **`ext.h`** — Kernel-internal header with scheduler hook declarations (`scx_tick`, `scx_enqueue`, etc.) and no-op stubs when `CONFIG_SCHED_CLASS_EXT` is disabled
+
+### Public header
+- **`include/linux/sched/ext.h`** — Defines `struct sched_ext_ops` (the BPF struct_ops table), `struct sched_ext_entity` (per-task state), and all constants/flags
+
+### BPF scheduler interface
+BPF schedulers implement callbacks in `struct sched_ext_ops` via `SEC(".struct_ops")`. Key callbacks: `select_cpu`, `enqueue`, `dequeue`, `dispatch`, `init`, `exit`. The kernel communicates with BPF through kfuncs prefixed `scx_bpf_*` (e.g., `scx_bpf_dsq_insert()`, `scx_bpf_select_cpu_dfl()`, `scx_bpf_pick_idle_cpu()`).
+
+### Dispatch Queues (DSQs)
+Central abstraction bridging the scheduler core and BPF:
+- `SCX_DSQ_GLOBAL` — Global FIFO queue
+- `SCX_DSQ_LOCAL` / `SCX_DSQ_LOCAL_ON | cpu` — Per-CPU local queues
+- Custom DSQs created with `scx_bpf_create_dsq()`
+
+A CPU runs tasks from its local DSQ; if empty, it pulls from the global DSQ, then calls `ops.dispatch()`.
+
+### Example schedulers (tools/sched_ext/)
+Each scheduler is a pair: `scx_foo.bpf.c` (BPF program) + `scx_foo.c` (userspace loader). Available schedulers: `scx_simple`, `scx_qmap`, `scx_central`, `scx_flatcg`, `scx_pair`, `scx_sdt`, `scx_cpu0`, `scx_userland`.
+
+Shared headers live in `tools/sched_ext/include/scx/`:
+- `common.bpf.h` — BPF kfunc declarations, helper macros
+- `common.h` — Userspace utilities (loading, stats printing)
+- `compat.bpf.h` / `compat.h` — Cross-kernel-version compatibility
+- `user_exit_info.h` / `user_exit_info.bpf.h` — Exit info shared between BPF and userspace
+
+### Selftest framework (tools/testing/selftests/sched_ext/)
+Tests follow a `*.bpf.c` + `*.c` pair pattern. Each test registers via `REGISTER_SCX_TEST()` (ELF constructor) and implements `setup`/`run`/`cleanup` returning `SCX_TEST_PASS`/`SCX_TEST_SKIP`/`SCX_TEST_FAIL`. The `runner` binary aggregates and executes all registered tests. Assertion macros: `SCX_FAIL_IF`, `SCX_EQ`, `SCX_GT`, `SCX_GE`, `SCX_LT`, `SCX_LE`, `SCX_ASSERT`.
+
+## Key Conventions
+
+- BPF struct_ops callbacks use `BPF_STRUCT_OPS()` / `BPF_STRUCT_OPS_SLEEPABLE()` macros
+- The sched_ext ABI between kernel and BPF schedulers has **no stability guarantees** across kernel versions
+- Schedulers must be compiled with `-target bpf` and linked through bpftool skeleton generation (`.bpf.c` → `.bpf.o` → `.bpf.skel.h`)
+- CFLAGS include `-Wall -Werror` for both tools and selftests
+- Production-ready schedulers live in the separate [sched-ext/scx](https://github.com/sched-ext/scx) repository; the in-tree ones are examples
+- Commit messages must include a `Signed-off-by:` line (use `git commit -s`)
+
+## Known Bugs in tools/sched_ext/
+
+### `common.h` `SCX_BUG` reads errno after fprintf
+The `SCX_BUG` macro calls `fprintf` before checking `errno`, but `fprintf` itself may clobber `errno`. The value should be saved before the first `fprintf` call.
+
+### `scx_simple` / `scx_cpu0` VLA in `read_stats`
+`__u64 cnts[2][nr_cpus]` on the stack; problematic at very high CPU counts (e.g. 4096+ CPUs = 64 KB stack).
+
+## Submitted Patches (pending upstream review)
+
+### `scx_idle_init_masks()` NUMA OOB fix
+`scx_idle_node_masks` was allocated with `num_possible_nodes()` (count) but indexed by node IDs via `for_each_node()`. On non-contiguous NUMA topologies, node IDs can exceed the array size. Fixed by allocating with `nr_node_ids`. Branch: `numa_id_alloc_fix`.
+
+### `sched_ext_entity` cache line layout optimization
+Reordered `ops_state`, `ddsp_dsq_id`, and `ddsp_enq_flags` to sit immediately after `dsq` in `struct sched_ext_entity` (`include/linux/sched/ext.h`). These fields are accessed together in the `do_enqueue_task()` and `finish_dispatch()` hot paths but were previously spread across three different cache lines. Branch: `sched_ext_entity_layout_upd`.
+
+### TOCTOU on `p->scx.dsq` in `scx_dump_task()` fix
+Used `READ_ONCE()` to capture `p->scx.dsq` into a local variable before dereferencing, preventing another CPU from NULLing the pointer between check and use. Branch: `scx_dump_concur_fix`.
+
+### `SCX_EFLAG_INITIALIZED` no-op flag fix
+`SCX_EFLAG_INITIALIZED` in `enum scx_exit_flags` defaulted to 0, making the `|=` in `scx_ops_init()` a no-op. BPF schedulers could not distinguish whether `ops.init()` completed. Assigned `1LLU << 0`. Branch: `SCX_EFLAG_INITIALIZED_value`.
+
+### Direct `scx_root` dereference without RCU in dump paths fix
+`scx_dump_task()` and `scx_dump_state()` now use `rcu_dereference()` to read `scx_root` under RCU protection, with an early return if NULL, preventing NULL-deref during concurrent scheduler teardown. Branch: `scx_dump_concur_fix`.
+
+## Analyzing Struct Cache Line Layouts with pahole
+
+To verify cache line placement of struct fields (e.g. when reviewing or proposing layout optimizations), use `pahole` on a compiled `.o` file from the kernel tree.
+
+### Setup
+```bash
+# Need: pahole (from dwarves package), libdw-dev, CONFIG_DEBUG_INFO_DWARF5=y
+make defconfig
+scripts/config --enable CONFIG_SCHED_CLASS_EXT --enable CONFIG_DEBUG_INFO \
+ --enable CONFIG_DEBUG_INFO_DWARF5 --enable CONFIG_SCHED_CORE \
+ --enable CONFIG_EXT_GROUP_SCHED
+make olddefconfig
+```
+
+### Build a single .o and inspect
+```bash
+make prepare -j$(nproc)
+# ext.c may fail with newer GCC; core.o also includes sched_ext_entity
+make kernel/sched/core.o -j$(nproc)
+pahole -C sched_ext_entity kernel/sched/core.o
+```
+
+### Before/after comparison workflow
+1. Build the `.o` on the current branch, save pahole output
+2. Checkout the patched header (`git checkout <branch> -- include/linux/sched/ext.h`)
+3. Rebuild the same `.o`, run pahole again
+4. Restore with `git checkout master -- include/linux/sched/ext.h`
+
+The output shows field offsets, sizes, and cacheline boundaries — look for hot-path fields that cross `/* --- cacheline N boundary --- */` markers.
+
+### Running kernel (alternative)
+```bash
+# If the struct exists in the running kernel's BTF:
+pahole -C sched_ext_entity /sys/kernel/btf/vmlinux
+```
+
+## Identified Optimization Opportunities
+
+### `struct scx_dispatch_q` false sharing (`include/linux/sched/ext.h`)
+`lock` (write-heavy) and `first_task` (read-mostly, lockless RCU peek) share the same cache line. Separating them with `____cacheline_aligned_in_smp` would eliminate false sharing on dispatch.
+
+### Repeated `idle_cpumask(node)` indirection (`kernel/sched/ext_idle.c`)
+Multiple calls within the same function re-evaluate the conditional pointer dereference; result should be cached in a local variable.
+
+### O(N^2) NUMA node traversal in `pick_idle_cpu_from_online_nodes()` (`kernel/sched/ext_idle.c`)
+Pre-computing per-CPU distance-ordered node arrays at init time would reduce this to O(N).
+
+### `flush_dispatch_buf` lock cycling (`kernel/sched/ext.c`)
+Each buffered task dispatched to a remote local DSQ causes a separate rq lock release/acquire cycle; batching by destination CPU would amortize lock overhead.
+
+### Missing `__always_inline` on hot helpers (`kernel/sched/ext_idle.c`)
+`idle_cpumask()`, `scx_cpu_node_if_enabled()`, `task_affinity_all()` are `static inline` but not `__always_inline`.
+
diff --git a/kernel/sched/ext.c b/kernel/sched/ext.c
index c18e81e8ef51..1048bb9934c5 100644
--- a/kernel/sched/ext.c
+++ b/kernel/sched/ext.c
@@ -1360,10 +1360,10 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
* is offline and are just running the hotplug path. Don't bother the
* BPF scheduler.
*/
- if (!scx_rq_online(rq))
+ if (unlikely(!scx_rq_online(rq)))
goto local;

- if (scx_rq_bypassing(rq)) {
+ if (unlikely(scx_rq_bypassing(rq))) {
__scx_add_event(sch, SCX_EV_BYPASS_DISPATCH, 1);
goto bypass;
}
@@ -1372,15 +1372,15 @@ static void do_enqueue_task(struct rq *rq, struct task_struct *p, u64 enq_flags,
goto direct;

/* see %SCX_OPS_ENQ_EXITING */
- if (!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
- unlikely(p->flags & PF_EXITING)) {
+ if (unlikely(!(sch->ops.flags & SCX_OPS_ENQ_EXITING) &&
+ p->flags & PF_EXITING)) {
__scx_add_event(sch, SCX_EV_ENQ_SKIP_EXITING, 1);
goto local;
}

/* see %SCX_OPS_ENQ_MIGRATION_DISABLED */
- if (!(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) &&
- is_migration_disabled(p)) {
+ if (unlikely(!(sch->ops.flags & SCX_OPS_ENQ_MIGRATION_DISABLED) &&
+ is_migration_disabled(p))) {
__scx_add_event(sch, SCX_EV_ENQ_SKIP_MIGRATION_DISABLED, 1);
goto local;
}
--
2.51.0