Re: [PATCH 1/2] tpm, tpm_tis: Handle interrupt storm

From: Jerry Snitselaar
Date: Tue May 23 2023 - 18:33:49 EST


On Tue, May 23, 2023 at 10:12:49PM +0300, Jarkko Sakkinen wrote:
> On Tue May 23, 2023 at 9:53 PM EEST, Jarkko Sakkinen wrote:
> > On Mon May 22, 2023 at 5:31 PM EEST, Lino Sanfilippo wrote:
> > > From: Lino Sanfilippo <l.sanfilippo@xxxxxxxxxx>
> > >
> > > Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled
> > > interrupts instead of polling on all capable TPMs. Unfortunately, on some
> > > products the interrupt line is either never asserted or never deasserted.
> > >
> > > The former causes interrupt timeouts and is detected by
> > > tpm_tis_core_init(). The latter results in interrupt storms.
> > >
> > > Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad
> > > L490 and Inspur NF5180M6:
> > >
> > > https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@xxxxxxxxxx/
> > > https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/
> > >
> > > The current approach to avoid those storms is to disable interrupts by
> > > adding a DMI quirk for the concerned device.
> > >
> > > However this is a maintenance burden in the long run, so use a generic
> > > approach:
> >
> > I'm trying to comprehend how you evaluate, how big maintenance burden
> > this would be. Adding even a few dozen table entries is not a
> > maintenance burden.

I do worry about how many cases will be reported once 6.4 is released,
and this eventually makes its way into distributions. In either case
the dmi table will need to be maintained. The UPX-11i case is a
different issue, and IIRC the L490 it needed a DMI entry, because
trying to catch the irq storm wasn't solving the issue there. I
imagine other odd cases will be popping up as well.

So far we have 2 irq storm reports with peterz's P360 Tiny, and I
guess that Inspur system reported by the kernel test robot. Then there
is whatever is going on with Peter Ujfalusi's UPX-11i.

> >
> > On the other hand any new functionality is objectively a maintanance
> > burden of some measure (applies to any functionality). So how do we know
> > that taking this change is less of a maintenance burden than just add
> > new table entries, as they come up?
> >
> > > Detect an interrupt storm by counting the number of unhandled interrupts
> > > within a 10 ms time interval. In case that more than 1000 were unhandled
> > > deactivate interrupts, deregister the handler and fall back to polling.
> >
> > I know it can be sometimes hard to evaluate but can you try to explain
> > how you came up to the 10 ms sampling period and 1000 interrupt
> > threshold? I just don't like abritrary numbers.
>
> Also here I wonder how you came up with this computational model. This
> is not same as saying it is wrong. There's just whole stack of options.
>
> Out of top of my head you could e.g. window average the duration between
> IRQs. When the average goes beyond threshold, then you shutdown
> interrupts.

Just to make sure I have it clear in my head, you mean when the
average is shorter than the threshold duration between interrupts,
yes? My brain wants to read 'When the average goes beyond threshold'
as 'threshold < average'.

Does the check need to be a rolling window like 1/2 currently has? I
expect that if the problem exists it will be noticed in the first
window checked. I think what I originally tried was to check over some
interval from when the handler first ran, disable interrupts if
needed, and then skip the check from then on when the handler ran.

Regards,
Jerry

>
> The pro I would see in this that it is much easier intuitively discuss
> how much there should be time in-between interrupts that the kernel
> handles it, than how many IRQs you can stack into time interval, which
> blows my head tbh.
>
> BR, Jarkko