Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995 enqueue_dl_entity (task blocked for more than 28262 seconds)
From: Vineeth Remanan Pillai
Date: Fri Dec 06 2024 - 11:57:49 EST
On Fri, Dec 6, 2024 at 10:18 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, Dec 04, 2024 at 01:47:44PM +0100, Ilya Maximets wrote:
> > Hi. It seems like I'm hitting some bug in the scheduler.
> >
> > I'm running some tests with Open vSwitch on v6.12 kernel and some time
> > 5 to 8 hours down the line I'm getting task blocked splats and I also
> > have a WARNING triggered in the scheduler code right before that:
> >
> > Dec 3 22:19:55 kernel: WARNING: CPU: 27 PID: 3391271 at kernel/sched/deadline.c:1995 enqueue_dl_entity
> >
> > I have a lot of processes (kernel threads and userpsace threads) stuck
> > in DN, Ds, D+ and D states. It feels like IO tasks are being scheduled,
> > but scheduler never picks them up or they are not being scheduled at all
> > for whatever reason, and threads waiting on these tasks are stuck.
> >
> > Dec 3 22:22:45 kernel: INFO: task khugepaged:330 blocked for more than 122 seconds.
> > Dec 3 22:22:45 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 122 seconds.
> > Dec 3 22:22:45 kernel: INFO: task mv:3483072 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task khugepaged:330 blocked for more than 245 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 245 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3480383 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3481787 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3482631 blocked for more than 122 seconds.
> > Dec 3 22:24:48 kernel: INFO: task mv:3483072 blocked for more than 245 seconds.
> > Dec 3 22:26:51 kernel: INFO: task khugepaged:330 blocked for more than 368 seconds.
> > ...
> > Dec 4 06:11:45 kernel: INFO: task khugepaged:330 blocked for more than 28262 seconds.
> >
> > I have two separate instances where this behavior is reproduced. One is mostly
> > around file systems, the other was more severe as multiple kernel threads got
> > stuck in netlink code. The traces do not have much in common, except that most
> > of blocked tasks are in scheduling. The system is also idle, nothing is really
> > running. Some of these tasks are holding resources that make other tasks to
> > block on those resources as well.
> >
> > I seem to be able to reproduce the issue, but it takes 5-8 hours to do so.
> >
>
> CC'ing a few more from my team as well.
>
> We haven't seen such an issue with the DL server, but we are also testing on
> slightly older kernels.
>
> Its coming from:
> WARN_ON_ONCE(on_dl_rq(dl_se));
>
Thanks for including me Joel :-)
I was able to reproduce this WARN_ON couple of days back with
syzkaller. dlserver's dl_se gets enqueued during a update_curr while
the dlserver is stopped. And subsequent dlserver start will cause a
double enqueue. On the peripheral, we don't track where dlserver is
active or not directly and an explicit tracking could solve this
issue. But the root cause is a little more deep and I think I
understood the real cause. I have a potential fix and doing more
testing to verify. Will send the fix out soon after a bit more
verification
Thanks,
Vineeth