Re: [PATCH] serial: 8250: Avoid "too much work" from bogus rx timeout interrupt

From: Andy Shevchenko
Date: Mon Dec 19 2016 - 12:33:47 EST


On Mon, 2016-12-19 at 09:12 -0800, Doug Anderson wrote:
> Hi,
>
> On Mon, Dec 19, 2016 at 4:59 AM, Andy Shevchenko
> <andriy.shevchenko@xxxxxxxxxxxxxxx> wrote:
> > On Sun, 2016-12-18 at 17:14 -0800, Douglas Anderson wrote:
> > > On a Rockchip rk3399-based board during suspend/resume testing, we
> > > found that we could get the console UART into a state where it
> > > would
> > > print this to the console a lot:
> > > Â serial8250: too much work for irq42
> >
> > Have you read the following discussion
> > https://www.spinics.net/lists/kernel/msg2059543.html
>
> No, I wasn't aware of that discussion.ÂÂYup, basically the exact same
> thing is happening here.ÂÂGood to know I'm not alone.ÂÂAny idea if the
> Baytrail UART is also based on DesignWare IP?

Yes. Almost all Intel HW is using DesignWare IP for HS UARTs.

> In that thread, Peter said:
>
> > I think there is every likelihood of spurious RX timeout interrupts
> > tripping this patch, sorry.
> >
> > Unfortunately, I think UART_BUG_ is the only viable possibility.
> > Or perhaps fixing the port type as PORT_8250 (thus disabling the
> > fifos).
>
> My change is slightly different than California's in that I'm actually
> throwing away the bogus byte and his patch was treating it as a valid
> byte.ÂÂI don't know if that makes the patch more or less palatable.

We need to test, especially in DMA case.

> I would hate to lose access to the FIFOs just due to this weird corner
> case.
>
> Do we really think there's a case where there's an RX Timeout
> interrupt w/ no "data ready" but that later the data ready will show
> up?ÂÂCan you quantify how much later you think it will show up?ÂÂIf we
> can quantify how much longer the data will show up in then we should
> probably just do a timeout loop right where I added my patch.
>
> Specifically, here's what's happening today with RX Timeout interrupt
> without "data ready":
>
> 1. We'll get the interrupt
> 2. We won't do _anything_ to service the interrupt.
> 3. We'll return back to serial8250_interrupt(), where we'll keep
> looping until we get "too much work"
> 4. We'll break out, but the interrupt will still be active.
> 5. Go back to #1
>
> ...and since this interrupt will keep firing and firing and firing
> with no delay in-between, we'll effectively lock the CPU up.

And the root cause of that is... ?

> If there are some UARTs that eventually get themselves out of this
> state by asserting "data ready" then the above won't be an "infinite"
> loop but it will effectively be a tight loop where we won't let
> userspace run and won't service other interrupts until we actually get
> the data ready.ÂÂSince we're already blocking everything else, it
> seems like it might be better to directly loop in
> serial8250_handle_irq() with a timeout of some sort (how long?ÂÂ100
> us?ÂÂ1 ms?).ÂÂThen we if we get the timeout then we can do the read
> and safely work ourselves free.

What I think is that the root cause of this is still unknown and either
above looks like a hack.

--
Andy Shevchenko <andriy.shevchenko@xxxxxxxxxxxxxxx>
Intel Finland Oy