Re: Kernel hang caused by commit "can: m_can: Start/Cancel polling timer together with interrupts"
From: Matthias Schiffer
Date: Wed Jul 10 2024 - 03:46:49 EST
On Tue, 2024-07-09 at 14:23 +0200, Markus Schneider-Pargmann wrote:
>
>
> Hi,
>
> On Wed, Jul 03, 2024 at 02:50:04PM GMT, Matthias Schiffer wrote:
> > On Tue, 2024-07-02 at 12:03 +0200, Matthias Schiffer wrote:
> > > On Tue, 2024-07-02 at 07:37 +0200, Linux regression tracking (Thorsten Leemhuis) wrote:
> > > >
> > > >
> > > > On 01.07.24 16:34, Markus Schneider-Pargmann wrote:
> > > > > On Mon, Jul 01, 2024 at 02:12:55PM GMT, Linux regression tracking (Thorsten Leemhuis) wrote:
> > > > > > [CCing the regression list, as it should be in the loop for regressions:
> > > > > > https://docs.kernel.org/admin-guide/reporting-regressions.html]
> > > > > >
> > > > > > Hi, Thorsten here, the Linux kernel's regression tracker. Top-posting
> > > > > > for once, to make this easily accessible to everyone.
> > > > > >
> > > > > > Hmm, looks like there was not even a single reply to below regression
> > > > > > report. But also seens Markus hasn't posted anything archived on Lore
> > > > > > since about three weeks now, so he might be on vacation.
> > > > > >
> > > > > > Marc, do you might have an idea what's wrong with the culprit? Or do we
> > > > > > expected Markus to be back in action soon?
> > > > >
> > > > > Great, ping here.
> > > >
> > > > Thx for replying!
> > > >
> > > > > @Matthias: Thanks for debugging and sorry for breaking it. If you have a
> > > > > fix for this, let me know. I have a lot of work right now, so I am not
> > > > > sure when I will have a proper fix ready. But it is on my todo list.
> > > >
> > > > Thx. This made me wonder: is "revert the culprit to resolve this quickly
> > > > and reapply it later together with a fix" something that we should
> > > > consider if a proper fix takes some time? Or is this not worth it in
> > > > this case or extremely hard? Or would it cause a regression on it's own
> > > > for users of 6.9?
> > > >
> > > > Ciao, Thorsten
> > >
> > > Hi,
> > >
> > > I think on 6.9 a revert is not easily possible (without reverting several other commits adding new
> > > features), but it should be considered for 6.6.
> > >
> > > I don't think further regressions are possible by reverting, as on 6.6 the timer is only used for
> > > platforms without an m_can IRQ, and on these platforms the current behavior is "the kernel
> > > reproducibly deadlocks in atomic context", so there is not much room for making it worse.
> > >
> > > Like Markus, I have writing a proper fix for this on my TODO list, but I'm not sure when I can get
> > > to it - hopefully next week.
> > >
> > > Best regards,
> > > Matthias
> >
> > A small update from my side:
> >
> > I had a short look into the issue today, but I've found that I don't quite grasp the (lack of)
> > locking in the m_can driver. The m_can_classdev fields active_interrupts and irqstatus are accessed
> > from a number of different contexts:
> >
> > - active_interrupts is *mostly* read and written from the ISR/hrtimer callback, but also from
> > m_can_start()/m_can_stop() and (in error paths) indirectly from m_can_poll() (NAPI callback). It is
> > not clear to me whether start/stop/poll could race with the ISR on a different CPU. Besides being
> > used for ndo_open/stop, m_can_start/stop also happen from PM callbacks.
> > - irqstatus is written from the ISR (or hrtimer callback) and read from m_can_poll() (NAPI callback)
> >
> > Is this correct without explicit sychronization, or should there be some locking or atomic for these
> > accesses?
>
> Thanks for pointing these out. I started creating some fixes for some of
> the patches. Not done yet, but I am working on it.
>
> Best,
> Markus
Hi Markus,
thanks for the update. I'm going to be out of office from Jul 12-26, so I will only be able to test
fixes when I'm back.
Best regards,
Matthias
>
> >
> > Best regards,
> > Matthias
> >
> >
> >
> > >
> > >
> > >
> > > >
> > > > > > On 18.06.24 18:12, Matthias Schiffer wrote:
> > > > > > > Hi Markus,
> > > > > > >
> > > > > > > we've found that recent kernels hang on the TI AM62x SoC (where no m_can interrupt is available and
> > > > > > > thus the polling timer is used), always a few seconds after the CAN interfaces are set up.
> > > > > > >
> > > > > > > I have bisected the issue to commit a163c5761019b ("can: m_can: Start/Cancel polling timer together
> > > > > > > with interrupts"). Both master and 6.6 stable (which received a backport of the commit) are
> > > > > > > affected. On 6.6 the commit is easy to revert, but on master a lot has happened on top of that
> > > > > > > change.
> > > > > > >
> > > > > > > As far as I can tell, the reason is that hrtimer_cancel() tries to cancel the timer synchronously,
> > > > > > > which will deadlock when called from the hrtimer callback itself (hrtimer_callback -> m_can_isr ->
> > > > > > > m_can_disable_all_interrupts -> hrtimer_cancel).
> > > > > > >
> > > > > > > I can try to come up with a fix, but I think you are much more familiar with the driver code. Please
> > > > > > > let me know if you need any more information.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Matthias
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > >
> >
> > --
> > TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany
> > Amtsgericht München, HRB 105018
> > Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider
> > https://www.tq-group.com/
--
TQ-Systems GmbH | Mühlstraße 2, Gut Delling | 82229 Seefeld, Germany
Amtsgericht München, HRB 105018
Geschäftsführer: Detlef Schneider, Rüdiger Stahl, Stefan Schneider
https://www.tq-group.com/