[PATCH 1/9] sched: add SCHED_WARN_ON()/SCHED_WARN_ON_ONCE()/SCHED_WARN()/SCHED_WARN_ONCE()

From: Rik van Riel

Date: Wed Jun 10 2026 - 22:24:16 EST


WARN_ON(), WARN_ON_ONCE(), WARN() and WARN_ONCE() emit at KERN_WARNING.
With a legacy (or boot) console registered, vprintk_emit() takes the
synchronous "legacy_direct" path: console_trylock_spinning() +
console_unlock() -> up(&console_sem) -> wake_up_process() ->
try_to_wake_up(), which acquires the woken task's ->pi_lock and its
rq->lock.

If the WARN fires while the current CPU already holds an rq->lock or a
->pi_lock -- the case for almost every WARN in the scheduling-class hot
paths -- this re-enters the scheduler and can deadlock (recursively on
rq->lock, or via the pi_lock/rq->lock ordering). The nbcon and klogd
wakeups are deferred via irq_work and are safe; only the legacy console
path is synchronous. Plain WARN*() emit at KERN_WARNING rather than
LOGLEVEL_SCHED, so they do not get the scheduler-safe deferral that
printk_deferred() does.

Add SCHED_WARN_ON()/SCHED_WARN_ON_ONCE() (and the SCHED_WARN()/
SCHED_WARN_ONCE() forms that carry a format message), which behave
exactly like their WARN*() counterparts but bracket the report in a
printk_deferred section, so the console output is handed to irq_work
instead of being emitted synchronously. The bracket is entered only on
the (cold) firing path, so the hot path cost is unchanged: just the
condition test.

printk_deferred_enter()/exit() toggle a per-CPU counter and must be
balanced on one CPU, so the caller must have preemption disabled. That
is always true while an rq->lock or ->pi_lock (both raw_spinlock_t,
which disable preemption) is held. A lockdep_assert_preemption_disabled()
guards against misuse from preemptible context and compiles away without
CONFIG_PROVE_LOCKING.

WARN_ON()/WARN_ON_ONCE() do not stringify their condition, so passing
the already-evaluated result as a constant produces identical console
output; the condition is evaluated exactly once and its boolean result
is returned, preserving the semantics for callers that test it.

No conversions are done here; this only adds the macros.

Signed-off-by: Rik van Riel <riel@xxxxxxxxxxx>
Assisted-by: Claude:claude-opus-4-8
---
kernel/sched/sched.h | 52 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7c2dea65edd..60739ccfc32f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
#include <linux/lockdep_api.h>
#include <linux/lockdep.h>
#include <linux/memblock.h>
+#include <linux/printk.h>
#include <linux/memcontrol.h>
#include <linux/minmax.h>
#include <linux/mm.h>
@@ -99,6 +100,57 @@ struct cpuidle_state;
#define TASK_ON_RQ_QUEUED 1
#define TASK_ON_RQ_MIGRATING 2

+/*
+ * SCHED_WARN_ON() / SCHED_WARN_ON_ONCE() / SCHED_WARN() / SCHED_WARN_ONCE():
+ * WARN_ON() / WARN_ON_ONCE() / WARN() / WARN_ONCE() variants that are safe to
+ * call while holding an rq->lock or a task's ->pi_lock.
+ *
+ * A plain WARN emits at KERN_WARNING. With a legacy console registered, the
+ * printk takes the synchronous path console_unlock() -> up(&console_sem) ->
+ * wake_up_process() -> try_to_wake_up(), which grabs ->pi_lock and rq->lock --
+ * and so deadlocks if such a lock is already held by the WARNing context.
+ *
+ * Bracket the report in a printk_deferred section so the console output is
+ * handed to irq_work instead. This is done only on the (cold) firing path, so
+ * the hot path keeps just the condition test. printk_deferred_enter()/exit()
+ * toggle a per-CPU counter and must be balanced on one CPU; the caller must
+ * therefore have preemption disabled, which is always true while an rq/pi
+ * raw_spinlock is held. The lockdep assert catches misuse from preemptible
+ * context and compiles away without CONFIG_PROVE_LOCKING.
+ */
+#define __SCHED_WARN_DEFERRED(__warn, x) \
+({ \
+ int __ret = !!(x); \
+ \
+ lockdep_assert_preemption_disabled(); \
+ if (unlikely(__ret)) { \
+ printk_deferred_enter(); \
+ __warn(1); \
+ printk_deferred_exit(); \
+ } \
+ __ret; \
+})
+
+#define SCHED_WARN_ON(x) __SCHED_WARN_DEFERRED(WARN_ON, x)
+#define SCHED_WARN_ON_ONCE(x) __SCHED_WARN_DEFERRED(WARN_ON_ONCE, x)
+
+/* As above, for the WARN()/WARN_ONCE() forms that carry a format message. */
+#define __SCHED_WARN_FMT_DEFERRED(__warn, x, fmt...) \
+({ \
+ int __ret = !!(x); \
+ \
+ lockdep_assert_preemption_disabled(); \
+ if (unlikely(__ret)) { \
+ printk_deferred_enter(); \
+ __warn(1, fmt); \
+ printk_deferred_exit(); \
+ } \
+ __ret; \
+})
+
+#define SCHED_WARN(x, fmt...) __SCHED_WARN_FMT_DEFERRED(WARN, x, fmt)
+#define SCHED_WARN_ONCE(x, fmt...) __SCHED_WARN_FMT_DEFERRED(WARN_ONCE, x, fmt)
+
extern __read_mostly int scheduler_running;

extern unsigned long calc_load_update;
--
2.53.0-Meta