Problems with TPM timeouts

From: Jonathan McDowell
Date: Wed Oct 02 2024 - 13:07:09 EST


We have been seeing a large number of TPM transmit problems across our
fleet, with frequent

tpm tpm0: tpm_try_transmit: send(): error -62

errors being logged. I don't have an on-demand reproducer, which makes
diagnosis difficult. In almost all cases it's a transient issue, and a
subsequent attempt to execute a command succeeds, but especially when
the kernel resource broker is involved that can still cause problems, as
the kernel is not doing retries here. Uptime does not seem to be a
factor.

This is not yet using the new HMAC session bits; kernels affected range
from at least 6.9 back to 5.12. Historically we've not paid attention to
TPMs long after initial boot, these days we're now looking at them
throughout the uptime of the machine so perhaps discovering something
that's been latent for a while.

I have a few things to try, which I'll describe below, but running
through them will take several months due to the difficulties in trying
to track the issue down over a production fleet. I'm posting here in
case anyone has any insight or ideas I might have missed.

First, I've seen James' post extending the TPM timeouts back in 2018
(https://lore.kernel.org/linux-integrity/1531329074.3260.9.camel@xxxxxxxxxxxxxxxxxxxxx/),
which doesn't seem to have been picked up. Was an alternative resolution
found, or are you still using this, James?

That was for a Nuvoton device; ours our Infineon devices. The behaviour
is not firmware specific; we see the problem with the latest 7.85
firmware as well as the older 7.62.

Things we are going to try:

* Direct usage of /dev/tpm0 rather than /dev/tpmrm0. This is not a long
term solution as we want multiple processes to be able to access the
TPM, but is easier to deploy. The expectation is this will lower the
number of issues due to fewer TPM commands being executed, but that
this is not the root cause.

* Retrying command submission on status timeout. We've had details of
an errata where the status register can become stuck, with the work
around being command resubmission. I've got a patch for this ready to
test - I'll follow up to this mail with it, but need to actually roll
it out and test before I'll submit it for inclusion.

* Instrumenting other timeout points to see if we're hitting a
different timeout.

J.

--
101 things you can't have too much of : 8 - Hard drive space.