Re: [WARNING] RCU stall in sock_def_readable()

From: Paul E. McKenney

Date: Fri Apr 17 2026 - 15:04:30 EST

On Fri, Apr 17, 2026 at 09:30:25AM -0400, Steven Rostedt wrote:
> On Fri, 17 Apr 2026 08:43:13 -0400
> Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
>
> > On Thu, 16 Apr 2026 17:16:11 -0700
> > "Paul E. McKenney" <paulmck@xxxxxxxxxx> wrote:
> >
> > > One "hail Mary" thought is to revert this guy and see if it helps:
> > >
> > > d41e37f26b31 ("rcu: Fix rcu_read_unlock() deadloop due to softirq")
> > >
> > > This commit fixes a bug, so we cannot revert it in mainline, but there
> > > is some reason to believe that there are more bugs beyond the one that
> > > it fixed, and it might have (through no fault of its own) made those
> > > other bugs more probable.
> > >
> > > Worth a try, anyway!
> >
> > Hail mary's are worth a try, but the reason they call it a hail mary is
> > because it is unlikely to succeed :-p

True enough!

> > run test ssh -t root@tracetest "trace-cmd record -p function -e syscalls /work/c/hackbench_64 50"
> > ssh -t root@tracetest "trace-cmd record -p function -e syscalls /work/c/hackbench_64 50" ... [ 209.590500] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
> > [ 209.592620] rcu: Tasks blocked on level-0 rcu_node (CPUs 0-3): P3151/1:b..l
> > [ 209.595266] rcu: (detected by 0, t=6502 jiffies, g=29673, q=186 ncpus=4)
> > [ 209.597557] task:hackbench_64 state:R running task stack:0 pid:3151 tgid:3151 ppid:3144 task_flags:0x400000 flags:0x00080000
> > [ 209.601871] Call Trace:
> > [ 209.602852] <TASK>
> > [ 209.603752] __schedule+0x4ac/0x12f0
> > [ 209.605172] preempt_schedule_common+0x26/0xe0
> > [ 209.606755] ? preempt_schedule_thunk+0x16/0x30
> > [ 209.608337] preempt_schedule_thunk+0x16/0x30
> > [ 209.609973] ? _raw_spin_unlock_irqrestore+0x39/0x70
> > [ 209.611688] _raw_spin_unlock_irqrestore+0x5d/0x70
> > [ 209.613408] sock_def_readable+0x9c/0x2b0
> > [ 209.614841] unix_stream_sendmsg+0x2d7/0x710
> > [ 209.616420] sock_write_iter+0x185/0x190
> > [ 209.617934] vfs_write+0x457/0x5b0
> > [ 209.619242] ksys_write+0xc8/0xf0
> > [ 209.620532] do_syscall_64+0x117/0x1660
> > [ 209.621936] ? irqentry_exit+0xd9/0x690
> > [ 209.623319] entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > [ 209.625199] RIP: 0033:0x7f603e8e5190
> > [ 209.626628] RSP: 002b:00007ffd003f99c8 EFLAGS: 00000202 ORIG_RAX: 0000000000000001
> > [ 209.629304] RAX: ffffffffffffffda RBX: 00007ffd003f9b58 RCX: 00007f603e8e5190
> > [ 209.631710] RDX: 0000000000000001 RSI: 00007ffd003f99ef RDI: 0000000000000006
> > [ 209.634200] RBP: 00007ffd003f9a40 R08: 0011861580000000 R09: 0000000000000000
> > [ 209.636638] R10: 00007f603e8064d0 R11: 0000000000000202 R12: 0000000000000000
> > [ 209.639050] R13: 00007ffd003f9b70 R14: 00005637df126dd8 R15: 00007f603ea10020
> > [ 209.641600] </TASK>
> > Detected kernel crash!
> >
> >
> > That was with the revert :-(
>
> I went and looked at the configs that it used to see if that changed.
> One thing that stands out is that it used to use
> CONFIG_PREEMPT_VOLUNTARY. Now it's using CONFIG_PREEMPT_LAZY. I'm
> thinking that because preemption now doesn't happen until tasks go back
> to user space (and kernel threads do not preempt at all), that this
> could have delayed the RCU threads much longer.

Good point, that could add latency and delay on average, though
it should presumably reduce P99 latency/delay.

> I'm not sure why the stack trace is always the same. Maybe that's where
> the biggest delay is caused by hackbench? I'm going to switch it over
> to PREEMPT_FULL and see if that makes the warning go away. Oh, and when
> I logged into this box, I noticed that it triggered an OOM due to
> memory not being freed up fast enough.

Sounds as likely a cause as any.

> All that said, my config is full of a lot of debugging that does have a
> high overhead which makes this issue much more predominate. It may not
> even be something to worry about. If switch to PREEMPT_FULL fixes it,
> then that may be all I do.

I am chasing one where switching to PREEMPT_FULL makes it worse, so
that would average things out. ;-)

> Configs that cause overhead:
>
> PROVE_LOCKING
> FTRACE_RECORD_RECURSION - keeps track of function trace recursion
> RING_BUFFER_VALIDATE_TIME_DELTAS - this causes a big overhead with tracing
> it tests the timestamps of every event.
> This requires walking the sub-buffer page
> and adding the time deltas of each event to
> make sure it adds up to the current event.
> That's an O(n^2) operation on the number of
> events in the sub-buffer.
>
> With the above overhead I do consider this one of those "Doctor it
> hurts me when I do this. Doctor: Then don't do that" moments. But this
> test has been running for years with no issues except for catching
> cases where the timestamp did get out of sync. Hence, I don't want to
> stop testing this. But if I can find the culprit, I can modify the test
> to avoid failing due to it.

Huh. One other possibility is that CONFIG_PREEMPT_NONE=y and its friend
CONFIG_PREEMPT_VOLUNTARY=y responded to RCU calls for help at any call
to cond_resched() or (in the voluntary case) might_sleep(). Isn't it
the case that with CONFIG_PREEMPT_LAZY=y, nothing happens until some
other task shows up on that CPU? Maybe RCU needs to be more aggressive
about invoking resched_cpu() when the grace period starts getting a bit
long in the tooth?

(I am very distracted today, so unlikely to make much progress on this.)

Thanx, Paul