Re: [PATCH 1/2] tpm, tpm_tis: Handle interrupt storm

From: Lino Sanfilippo
Date: Thu May 25 2023 - 19:45:38 EST


Hi,

On 24.05.23 05:58, Jarkko Sakkinen wrote:

>> Commit e644b2f498d2 ("tpm, tpm_tis: Enable interrupt test") enabled
>> interrupts instead of polling on all capable TPMs. Unfortunately, on some
>> products the interrupt line is either never asserted or never deasserted.
>
> Use Reported-by and Closes tag and remove this paragraph.
>
> In Closes link instead from lore the email where the bug was reported.

Ok

>
>> The former causes interrupt timeouts and is detected by
>> tpm_tis_core_init(). The latter results in interrupt storms.
>
> Please describe instead the system where the bug was realized. Don't
> worry about speculative descriptions. We only deal with ones actually
> realized.
>
>> Recent reports concern the Lenovo ThinkStation P360 Tiny, Lenovo ThinkPad
>> L490 and Inspur NF5180M6:
>>
>> https://lore.kernel.org/linux-integrity/20230511005403.24689-1-jsnitsel@xxxxxxxxxx/
>> https://lore.kernel.org/linux-integrity/d80b180a569a9f068d3a2614f062cfa3a78af5a6.camel@xxxxxxxxxx/
>
> Please remove all of this, as the fixes have been handled. Let's keep
> the commit message focused.
>

Ok.

>> The current approach to avoid those storms is to disable interrupts by
>> adding a DMI quirk for the concerned device.
>>
>> However this is a maintenance burden in the long run, so use a generic
>> approach:
>>
>> Detect an interrupt storm by counting the number of unhandled interrupts
>> within a 10 ms time interval. In case that more than 1000 were unhandled
>> deactivate interrupts, deregister the handler and fall back to polling.
>>
>> This equals the implementation that handles interrupt storms in
>> note_interrupt() by means of timestamps and counters in struct irq_desc.
>> However the function to access this structure is private so the logic has
>> to be reimplemented in the TPM TIS core.
>
> I only now found out that this is based on kernel/irq/spurious.c code
> partly. Why this was unmentioned?
>
> That would make this already more legitimate because it is based
> on field tested metrics.
>> Then we only have to discuss about counter.

I mentioned this in one of my responses:
https://lore.kernel.org/all/da435e0d-5f22-fac7-bc10-96a0fd4c6d54@xxxxxxxxxx/

>
>> routine trigger a worker thread that executes the unregistration.
>>
>> Suggested-by: Lukas Wunner <lukas@xxxxxxxxx>
>> Signed-off-by: Lino Sanfilippo <l.sanfilippo@xxxxxxxxxx>
>> ---
>> drivers/char/tpm/tpm_tis_core.c | 71 +++++++++++++++++++++++++++++++--
>> drivers/char/tpm/tpm_tis_core.h | 6 +++
>> 2 files changed, 74 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/char/tpm/tpm_tis_core.c b/drivers/char/tpm/tpm_tis_core.c
>> index 558144fa707a..458ebf8c2f16 100644
>> --- a/drivers/char/tpm/tpm_tis_core.c
>> +++ b/drivers/char/tpm/tpm_tis_core.c
>> @@ -752,6 +752,55 @@ static bool tpm_tis_req_canceled(struct tpm_chip *chip, u8 status)
>> return status == TPM_STS_COMMAND_READY;
>> }
>>
>> +static void tpm_tis_handle_irq_storm(struct tpm_chip *chip)
>> +{
>> + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
>> + int intmask = 0;
>> +
>> + dev_err(&chip->dev, HW_ERR
>> + "TPM interrupt storm detected, polling instead\n");
>
> Degrading this to warn is fine because it is legit behaviour in a
> sense.
>

Agreed.

>> +
>> + tpm_tis_read32(priv, TPM_INT_ENABLE(priv->locality), &intmask);
>> +
>> + intmask &= ~TPM_GLOBAL_INT_ENABLE;
>> +
>> + tpm_tis_request_locality(chip, 0);
>> + tpm_tis_write32(priv, TPM_INT_ENABLE(priv->locality), intmask);
>> + tpm_tis_relinquish_locality(chip, 0);
>> +
>> + chip->flags &= ~TPM_CHIP_FLAG_IRQ;
>> +
>> + /*
>> + * We must not call devm_free_irq() from within the interrupt handler,
>
> Never use "we" form. Always use either:
>
> 1. Imperative
> 2. Passive
>
> I.e. to address this, you would write instead "devm_free_irq() must not
> be called within the interrupt handler because ...".
>

Ok.

>> + * since this function waits for running interrupt handlers to finish
>> + * and thus it would deadlock. Instead trigger a worker that does the
>> + * unregistration.
>> + */
>> + schedule_work(&priv->free_irq_work);
>> +}
>> +
>> +static void tpm_tis_process_unhandled_interrupt(struct tpm_chip *chip)
>> +{
>> + const unsigned int MAX_UNHANDLED_IRQS = 1000;
>> + struct tpm_tis_data *priv = dev_get_drvdata(&chip->dev);
>
> Reverse order and add empty line.
>

Ok.

>> + /*
>> + * The worker to free the TPM interrupt (free_irq_work) may already
>> + * be scheduled, so make sure it is not scheduled again.
>> + */
>> + if (!(chip->flags & TPM_CHIP_FLAG_IRQ))
>> + return;
>> +
>> + if (time_after(jiffies, priv->last_unhandled_irq + HZ/10))
>> + priv->unhandled_irqs = 1;
>> + else
>> + priv->unhandled_irqs++;
>> +
>> + priv->last_unhandled_irq = jiffies;
>> +
>> + if (priv->unhandled_irqs > MAX_UNHANDLED_IRQS)
>> + tpm_tis_handle_irq_storm(chip);
>
> Why wouldn't we switch to polling mode even when there is a single
> unhandled IRQ?
>

As Lukas wrote, not handling an interrupt may be legit if it was raised
by a device that shares the interrupt line with the TPM.

Regards,
Lino