RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI) enabled on some platform

From: Song Bao Hua (Barry Song)
Date: Sat Feb 13 2021 - 18:33:35 EST




> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Sunday, February 14, 2021 11:13 AM
> To: 'Arnd Bergmann' <arnd@xxxxxxxxxx>
> Cc: tglx@xxxxxxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; arnd@xxxxxxxx;
> geert@xxxxxxxxxxxxxx; funaho@xxxxxxxxx; philb@xxxxxxx; corbet@xxxxxxx;
> mingo@xxxxxxxxxx; linux-m68k@xxxxxxxxxxxxxxxxxxxx;
> fthain@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: RE: [RFC] IRQ handlers run with some high-priority interrupts(not NMI)
> enabled on some platform
>
>
>
> > -----Original Message-----
> > From: Arnd Bergmann [mailto:arnd@xxxxxxxxxx]
> > Sent: Sunday, February 14, 2021 5:32 AM
> > To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> > Cc: tglx@xxxxxxxxxxxxx; gregkh@xxxxxxxxxxxxxxxxxxx; arnd@xxxxxxxx;
> > geert@xxxxxxxxxxxxxx; funaho@xxxxxxxxx; philb@xxxxxxx; corbet@xxxxxxx;
> > mingo@xxxxxxxxxx; linux-m68k@xxxxxxxxxxxxxxxxxxxx;
> > fthain@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> > Subject: Re: [RFC] IRQ handlers run with some high-priority interrupts(not
> NMI)
> > enabled on some platform
> >
> > On Sat, Feb 13, 2021 at 12:50 AM Song Bao Hua (Barry Song)
> > <song.bao.hua@xxxxxxxxxxxxx> wrote:
> >
> > > So I was actually trying to warn this unusual case - interrupts
> > > get nested while both in_hardirq() and irqs_disabled() are true.
> > >
> > > diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
> > > index 7c9d6a2d7e90..b8ca27555c76 100644
> > > --- a/include/linux/hardirq.h
> > > +++ b/include/linux/hardirq.h
> > > @@ -32,6 +32,7 @@ static __always_inline void rcu_irq_enter_check_tick(void)
> > > */
> > > #define __irq_enter() \
> > > do { \
> > > + WARN_ONCE(in_hardirq() && irqs_disabled(), "nested
> > > interrupts\n"); \
> > > preempt_count_add(HARDIRQ_OFFSET); \
> >
> > That seems to be a rather heavyweight change in a critical path.
> >
> > A more useful change might be to implement lockdep support for m68k
> > and see if that warns about any actual problems. I'm not sure
> > what is actually missing for that, but these are the commits that
> > added it for other architectures in the past:
> >
> > 3c4697982982 ("riscv: Enable LOCKDEP_SUPPORT & fixup
> TRACE_IRQFLAGS_SUPPORT")
> > 000591f1ca33 ("csky: Enable LOCKDEP_SUPPORT")
> > 78cdfb5cf15e ("openrisc: enable LOCKDEP_SUPPORT and irqflags tracing")
> > 8f371c752154 ("xtensa: enable lockdep support")
> > bf2d80966890 ("microblaze: Lockdep support")
> >
>
> Yes. M68k lacks lockdep support which might be added.

BTW, probably m68k won't run into any problem with lockdep
as it has been running for decades. Just like interrupts
were widely allowed to preempt irq handlers on all platforms
before IRQF_DISABLED was dropped and commit e58aa3d2d0cc ("
genirq: Run irq handlers with interrupts disabled").
Rarely we could really run into the stack overflow
issue commit e58aa3d2d0cc mentioned at that time.
Before those commits we had already made thousands of
successful Linux products running irq handlers with
interrupts enabled.

So what is really confusing and a pain to me is that:
For years people like me have been writing device drivers
with the idea that irq handlers run with interrupts
disabled after those commits in genirq. So I don't need
to care about if some other IRQs on the same cpu will
jump out to access the data the current IRQ handler
is accessing.

but it turns out the assumption is not true on some platform.
So should I start to program devices driver with the new idea
interrupts can actually come while irqhandler is running?

That's the question which really bothers me.

>
> > > And I also think it is better for m68k's arch_irqs_disabled() to
> > > return true only when both low and high priority interrupts are
> > > disabled rather than try to mute this warn in genirq by a weaker
> > > condition:
> > > if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled
> > interrupts\n",
> > > irq, action->handler))
> > > local_irq_disable();
> > > }
> > >
> > > This warn is not activated on m68k because its arch_irqs_disabled() return
> > > true though its high-priority interrupts are still enabled.
> >
> > Then it would just end up always warning when a nested hardirq happens,
> > right? That seems no different to dropping support for nested hardirqs
> > on m68k altogether, which of course is what you suggested already.
>
> This won't end up a warning on other architectures like arm,arm64, x86 etc
> as interrupts won't come while arch_irqs_disabled() is true in hardIRQ.
> For example, I_BIT of CPSR of ARM is set:
> static inline int arch_irqs_disabled_flags(unsigned long flags)
> {
> return flags & IRQMASK_I_BIT;
> }
>
> So it would only give a backtrace on platforms whose arch_irqs_disabled()
> return true while only some interrupts are disabled and some others
> are still open, thus nested interrupts can come without any explicit
> code to enable interrupts.
>
> This warn seems to give consistent interpretation on what's "Run irq
> handlers with interrupts disabled" in commit e58aa3d2d0cc (" genirq:
> Run irq handlers with interrupts disabled")
>
> >
> > Arnd
>

Thanks
Barry