Re: [PATCH] sched/proxy_exec: Limit find_proxy_task() chain depth to prevent CPU hang

From: K Prateek Nayak

Date: Mon Apr 20 2026 - 23:19:04 EST

Hello John, Zhidao,

On 4/21/2026 7:57 AM, John Stultz wrote:
>> With this fix:
>>
>> [ 111.758150] sched/pe: proxy chain depth exceeded 64, possible deadlock cycle involving pid 120
>> [ 111.758150] WARNING: CPU: 0 PID: 119 at kernel/sched/core.c:7339 __schedule+0x1e6e/0x1e80
>> ...
>> [ 112.694277] pe_cycle_test: still alive after 1s (CPU not hung)
>>
>> Without this fix, an NMI watchdog (nmi_watchdog=1, watchdog_thresh=15)
>> fires a hard LOCKUP on CPU 0 with RIP in do_raw_spin_lock, called from
>> __schedule, confirming the CPU spins inside find_proxy_task() holding
>> rq->lock with no forward progress:
>>
>> [ 109.951781] watchdog: CPU0: Watchdog detected hard LOCKUP on cpu 0
>> [ 109.951781] RIP: 0010:do_raw_spin_lock+0x3e/0xb0
>> [ 109.951781] Call Trace:
>> [ 109.951781] __schedule+0x11e7/0x1e10
>> [ 109.951781] schedule_preempt_disabled+0x18/0x30
>> [ 109.951781] __mutex_lock+0x6f0/0xac0
>> [ 109.951781] pe_test_thread_a+0x9c/0xe0
>
>
> So, I guess I'd be curious what happens without proxy-exec.

I think you hit the hung task detector in that case since the
interruptible sleep has lingered for too long.

>
> My sense if if you have a mutex lock cycle today without proxy
> execution you'll just deadlock and get a similar hard LOCKUP warning.
> I assume you'd get a LOCKDEP splat as well if that was enabled in
> either case, no?
>
> So I'm not sure if I see a whole lot of benefit to rescheduling idle
> over and over to keep the system sort of alive when that cpu is not
> going to make any progress.
>
> A few more thoughts below...
>
>> Fixes: 7de9d4f94638 ("sched: Start blocked_on chain processing in find_proxy_task()")
>> Signed-off-by: zhidao su <suzhidao@xxxxxxxxxx>
>> ---
>> kernel/sched/core.c | 17 +++++++++++++++++
>> 1 file changed, 17 insertions(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index 3f3425c6b2f2..bafb59432f7f 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7310,6 +7310,17 @@ DEFINE_LOCK_GUARD_1(blocked_on_lock, struct blocked_on_lock,
>> * Returns the task that is going to be used as execution context (the one
>> * that is actually going to be run on cpu_of(rq)).
>> */
>> +/*
>> + * Limit proxy chain traversal depth to avoid infinite loops in pathological
>> + * cases (e.g., A waits for B's mutex while B waits for A's mutex). The
>> + * existing WARN_ON(owner == p) only catches immediate self-loops; multi-task
>> + * cycles like A->B->A are not detected without a depth counter.
>> + *
>> + * rt-mutex uses a similar guard (max_lock_depth = 1024). We use a smaller
>> + * limit since proxy chains are expected to be short in practice.
>> + */
>> +#define MAX_PROXY_CHAIN_DEPTH 64
>
> So while we'd hope proxy chains are short in most cases, there's no
> guarantee they would be different from rt-mutexes.
> In fact, with rwsem support, the chains could interleave across lock
> types, so I'd probably at least match the rt-mutex value.
>
>> static struct task_struct *
>> find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>> __must_hold(__rq_lockp(rq))
>> @@ -7318,11 +7329,17 @@ find_proxy_task(struct rq *rq, struct task_struct *donor, struct rq_flags *rf)
>> struct task_struct *owner = NULL;
>> bool curr_in_chain = false;
>> int this_cpu = cpu_of(rq);
>> + int chain_depth = 0;
>> struct task_struct *p;
>> int owner_cpu;
>>
>> /* Follow blocked_on chain. */
>> for (p = donor; task_is_blocked(p); p = owner) {
>> + if (++chain_depth > MAX_PROXY_CHAIN_DEPTH) {
>> + WARN_ONCE(1, "sched/pe: proxy chain depth exceeded %d, possible deadlock cycle involving pid %d\n",
>> + MAX_PROXY_CHAIN_DEPTH, p->pid);
>> + return proxy_resched_idle(rq);
>
> So at this point the cpu is going to be stuck, as as soon as it
> switches to idle, it will call back into __schedule(), select the same
> donor task and and traverse the same chain, and then reschedule idle
> and start again.
>
> So it seems to me like BUG() would be more appropriate here as the cpu
> is effectively deadlocked.
>
> I guess one could deactivate the selected blocked donor task, which
> would let the cpu continue to run other tasks, but the entire lock
> chain would eventually get deactivated and would never be woken up, so
> it would likely trip hung task warnings. So I of would lean towards
> BUG() since lock cycles are a big no no (for non-ww_mutexes) and I'd
> fret if you don't stop the system folks will just ignore warnings and
> not really understand why things aren't working properly.
>
> But that's just my instinct.

I would second that but I can see someone having a "creative"
mutex_lock_interruptible() pattern that relies on the hung task splat
to then trigger something from userspace to selectively kill tasks.
(Insane? Yes! Possible? Also yes!)

As an alternate approach, when traversing blocked_on links, can we
start deactivating the chain if we encounter rq->donor again as owner
in find_proxy_task() loop?

That way we go back to triggering the hung task detector and if someone
has a stack that depends on it, it'll continue to work fine while also
avoiding this lockup.

Thoughts?

>
> Anyway, thanks for the submission here! I'm excited to see more folks
> working and testing with proxy-exec!

+1. With the next batch of changes, when we hopefully drop the EXPERT
dependency, we'll probably see even wider usage and development ;-)

--
Thanks and Regards,
Prateek