Re: [PATCH] tpm: Make timeout logic simpler and more robust
From: Mimi Zohar
Date: Tue Mar 12 2019 - 16:56:37 EST
On Tue, 2019-03-12 at 20:08 +0000, Calvin Owens wrote:
> On Tuesday 03/12 at 13:04 -0400, Mimi Zohar wrote:
> > On Mon, 2019-03-11 at 16:54 -0700, Calvin Owens wrote:
> > > We're having lots of problems with TPM commands timing out, and we're
> > > seeing these problems across lots of different hardware (both v1/v2).
> > >
> > > I instrumented the driver to collect latency data, but I wasn't able to
> > > find any specific timeout to fix: it seems like many of them are too
> > > aggressive. So I tried replacing all the timeout logic with a single
> > > universal long timeout, and found that makes our TPMs 100% reliable.
> > >
> > > Given that this timeout logic is very complex, problematic, and appears
> > > to serve no real purpose, I propose simply deleting all of it.
> >
> > Normally before sending such a massive change like this, included in
> > the bug report or patch description, there would be some indication as
> > to which kernel introduced a regression. Has this always been a
> > problem? Is this something new? How new?
>
> Honestly we've always had problems with flakiness from these devices,
> but it seems to have regressed sometime between 4.11 and 4.16.
Well, that's a start. ÂAround 4.10 is when we started noticing TPM
performance issues due to the change in the kernel timer scheduling.
ÂThis resulted in commitÂa233a0289cf9 ("tpm: msleep() delays - replace
with usleep_range() in i2c nuvoton driver"), which was upstreamed in
4.12.
At the other end, James was referring to commit "424eaf910c32 tpm:
reduce polling time to usecs for even finer granularity", which was
introduced in 4.18.
>
> I wish a had a better answer for you: we need on the order of a hundred
> machines to see the difference, and setting up these 100+ machine tests
> is unfortunately involved enough that e.g. bisecting it just isn't
> feasible :/
> What I can say for sure is that this patch makes everything much better
> for us. If there's anything in particular you'd like me to test, I have
> an army of machines I'm happy to put to use, let me know :)
I would assume not all of your machines are the same nor have the same
TPM. ÂCould you verify that this problem is across the board, not
limited to a particular TPM.
BTW, are you seeing this problem with both TPM 1.2 or 2.0?
thanks!
Mimi