Re: [v6.12] WARNING: at kernel/sched/deadline.c:1995 enqueue_dl_entity (task blocked for more than 28262 seconds)
From: Joel Fernandes
Date: Fri Dec 06 2024 - 12:27:51 EST
On Fri, Dec 6, 2024 at 11:57 AM Vineeth Remanan Pillai
<vineeth@xxxxxxxxxxxxxxx> wrote:
>
> On Fri, Dec 6, 2024 at 10:18 AM Joel Fernandes <joel@xxxxxxxxxxxxxxxxx> wrote:
> >
> > On Wed, Dec 04, 2024 at 01:47:44PM +0100, Ilya Maximets wrote:
> > > Hi. It seems like I'm hitting some bug in the scheduler.
> > >
> > > I'm running some tests with Open vSwitch on v6.12 kernel and some time
> > > 5 to 8 hours down the line I'm getting task blocked splats and I also
> > > have a WARNING triggered in the scheduler code right before that:
> > >
> > > Dec 3 22:19:55 kernel: WARNING: CPU: 27 PID: 3391271 at kernel/sched/deadline.c:1995 enqueue_dl_entity
> > >
> > > I have a lot of processes (kernel threads and userpsace threads) stuck
> > > in DN, Ds, D+ and D states. It feels like IO tasks are being scheduled,
> > > but scheduler never picks them up or they are not being scheduled at all
> > > for whatever reason, and threads waiting on these tasks are stuck.
> > >
> > > Dec 3 22:22:45 kernel: INFO: task khugepaged:330 blocked for more than 122 seconds.
> > > Dec 3 22:22:45 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 122 seconds.
> > > Dec 3 22:22:45 kernel: INFO: task mv:3483072 blocked for more than 122 seconds.
> > > Dec 3 22:24:48 kernel: INFO: task khugepaged:330 blocked for more than 245 seconds.
> > > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3479822 blocked for more than 245 seconds.
> > > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3480383 blocked for more than 122 seconds.
> > > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3481787 blocked for more than 122 seconds.
> > > Dec 3 22:24:48 kernel: INFO: task ovs-monitor-ips:3482631 blocked for more than 122 seconds.
> > > Dec 3 22:24:48 kernel: INFO: task mv:3483072 blocked for more than 245 seconds.
> > > Dec 3 22:26:51 kernel: INFO: task khugepaged:330 blocked for more than 368 seconds.
> > > ...
> > > Dec 4 06:11:45 kernel: INFO: task khugepaged:330 blocked for more than 28262 seconds.
> > >
> > > I have two separate instances where this behavior is reproduced. One is mostly
> > > around file systems, the other was more severe as multiple kernel threads got
> > > stuck in netlink code. The traces do not have much in common, except that most
> > > of blocked tasks are in scheduling. The system is also idle, nothing is really
> > > running. Some of these tasks are holding resources that make other tasks to
> > > block on those resources as well.
> > >
> > > I seem to be able to reproduce the issue, but it takes 5-8 hours to do so.
> > >
> >
> > CC'ing a few more from my team as well.
> >
> > We haven't seen such an issue with the DL server, but we are also testing on
> > slightly older kernels.
> >
> > Its coming from:
> > WARN_ON_ONCE(on_dl_rq(dl_se));
> >
>
> Thanks for including me Joel :-)
>
> I was able to reproduce this WARN_ON couple of days back with
> syzkaller. dlserver's dl_se gets enqueued during a update_curr while
> the dlserver is stopped. And subsequent dlserver start will cause a
> double enqueue. On the peripheral, we don't track where dlserver is
> active or not directly and an explicit tracking could solve this
> issue. But the root cause is a little more deep and I think I
> understood the real cause. I have a potential fix and doing more
> testing to verify. Will send the fix out soon after a bit more
> verification
Oh, so we _have_ seen this issue :-). Thanks Vineeth, looking forward
to your fix! By the way, I do remember now some variation of this that
happened a long time ago but I thought it was fixed.
- Joel