Re: [REGRESSION] suspend to ram fails in 6.2-rc1 due to tpm errors
From: Jason A. Donenfeld
Date: Wed Dec 28 2022 - 23:04:05 EST
On Wed, Dec 28, 2022 at 06:07:25PM -0500, James Bottomley wrote:
> On Wed, 2022-12-28 at 21:22 +0100, Vlastimil Babka wrote:
> > Ugh, while the problem [1] was fixed in 6.1, it's now happening again
> > on the T460 with 6.2-rc1. Except I didn't see any oops message or
> > "tpm_try_transmit" error this time. The first indication of a problem
> > is this during a resume from suspend to ram:
> >
> > tpm tpm0: A TPM error (28) occurred continue selftest
> >
> > and then periodically
> >
> > tpm tpm0: A TPM error (28) occurred attempting get random
>
> That's a TPM 1.2 error which means the TPM failed the selftest. The
> original problem was reported against TPM 2.0 because of a missing
> try_get_ops().
No, I'm pretty sure the original bug, which was fixed by "char: tpm:
Protect tpm_pm_suspend with locks" regards 1.2 as well, especially
considering it's the same hardware from Vlastimil causing this. I also
recall seeing this in 1.2 when I ran this with the TPM emulator. So
that's not correct.
> The tpm 1.2 command path was never changed to require
> this (and in fact hasn't changed for ages, TPM 1.2 being a bit
> obsolete).
False. I'll just copy and paste the diff of that commit:
diff --git a/drivers/char/tpm/tpm-interface.c b/drivers/char/tpm/tpm-interface.c
index 1621ce818705..d69905233aff 100644
--- a/drivers/char/tpm/tpm-interface.c
+++ b/drivers/char/tpm/tpm-interface.c
@@ -401,13 +401,14 @@ int tpm_pm_suspend(struct device *dev)
!pm_suspend_via_firmware())
goto suspended;
- if (!tpm_chip_start(chip)) {
+ rc = tpm_try_get_ops(chip);
+ if (!rc) {
if (chip->flags & TPM_CHIP_FLAG_TPM2)
tpm2_shutdown(chip, TPM2_SU_STATE);
else
rc = tpm1_pm_suspend(chip, tpm_suspend_pcr);
- tpm_chip_stop(chip);
+ tpm_put_ops(chip);
}
suspended:
So, as you can see, this affects the call to tpm1_pm_suspend.
> So this looks like a new problem with TPM 1.2 and
> suspend/resume, likely in the BIOS of your system.
No, this is not a BIOS problem. This is a kernel bug in the TPM 1.2
driver. Yes, it'd be very nice to wave this all away and blame it on the
BIOS, but I really don't think that's the case, especially considering
this is all reproducible in the emulator.
I recall seeing something pretty similar to this report with the
selftest as well. Basically, the call to tpm1_do_selftest can race with
the call to tpm1_get_random, presumably because tpm1_get_random doesn't
do any locking on its own. So it might be a good idea to make sure that
tpm1_get_random() isn't running before tpm1_do_selftest() or any other
TPM commands run.
I spent a long time working through the TPM code when this came up
during 6.1. I set up the TPM emulator with QEMU and reproduced this and
had a whole test setup and S3 fuzzer. It took a long time, and when I was
done, I paged it all out of my brain. When I found that patch from Jan
that fixed the problem most of the time (but not all the time), I wasted
tons of time trying to get the TPM maintainers to take the patch and
send it to Linus as part of rc7 or rc8. But they all ignored me, and
eventually Linus just took that patch directly.
So I don't think I want to go down another rabbit hole here, having
experienced the TPM maintainers not really caring much, and that sucking
away the remaining energy I had before to keep looking at the issue and
its edge cases not handled by Jan's patch.
On the contrary, it'd make a big difference if the TPM maintainers could
actually help analyze the code that they're most familiar with, so that
we can get to the bottom of this. That's a lot better than some random
drive-by patches from a non-TPM person like me; before the 6.1 bug, I'd
never even looked at these drivers.
My plan is to therefore be available to help and analyze and test and
maybe even write some code, if the TPM maintainers take the lead on
getting to the bottom of this. But if this hits neglect again like last
time, I'll just send a `depends on BROKEN if PM` patch to the TPM
hw_random driver and see what happens... That's a really awful solution
though, so I hope the maintainers will wake up this cycle.
Regards,
Jason