Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
From: Doug Anderson
Date: Tue Oct 10 2017 - 12:53:07 EST
Hi,
On Tue, Oct 10, 2017 at 8:33 AM, Shawn N <shawnn@xxxxxxxxxxxx> wrote:
>> diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
>> index 5cf987b5401e..0baa6bfc0f36 100644
>> --- a/arch/arm/boot/dts/tegra124-nyan.dtsi
>> +++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
>> @@ -317,6 +317,7 @@
>> interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
>> reg = <0>;
>>
>> + google,cros-ec-spi-pre-delay = <10>;
>> google,cros-ec-spi-msg-delay = <2000>;
>>
>> i2c-tunnel {
>>
>> I have tried 50 boots with the above and I have seen no SPI failures
>> on boot.
>>
>> I did look to see if it is possible to probe the SPI signals with a
>> scope but from the schematics I am not sure if they are accessible or
>> buried in the PCB.
>>
>> Is it possible that Tegra is sending the SPI message too soon for the
>> EC?
>
> I have not worked much with the cros_ec stm32 SPI host interface, but
> I think it's possible. Also note that we define
> "google,cros-ec-spi-pre-delay = <30>;" for the veyron family of
> devices, which also use an stm32-family embedded controller. Some of
> the folks on cc worked more on veyron and may have more insight.
Alex is the expert here, but basically we enabled hibernation on the
EC for veyron so it needed a bunch of extra time to wakeup. Before
hibernation there was no known case where the delay was needed as far
as I understand. I don't think we had hibernation on tegra.
It's _possible_ that we sometimes need a delay even on tegra, but I'm
not sure. It's also possible that the extra delay just shifts things
around a little bit and changes the timing. Possibly at boot the
system is always extra busy and adding the 10 us delay is enough to
"reliably" get your transfer to happen at a non-busy time. It would
be interesting to see of a 10us delay _before_ asserting chip select
fixed things just as well. If a delay before asserting chip select
fixes things as well as a delay after asserting chip select then
you're probably just pushing things around and not doing a real fix.
One issue I _know_ we have is that we sometimes drop EC packets on the
floor because we get interrupted midway through the transfer and the
EC gives up on us before we get context switched back in. I think
this is more common at suspend/resume and at boot when the system is
typically very busy. That case can actually be made _worse_ by the
"cros-ec-spi-pre-delay", as described in
<https://chromium-review.googlesource.com/439713>. Even when we
removed delay on rk3399-gru-* we could still sometimes see transfer
errors and the agreed upon "fix" for this is to do the transfer in a
higher priority context. Some details are in the (unfortunately
private) <https://issuetracker.google.com/35579351>, but to copy the
important bits from the bug::
>From Mark Brown (upstream SPI maintainer) as long as there is no
contention on this SPI bus (there isn't for us) the transfer will
happen in the context of the calling process. That means "bumping"
the priority would work. ...but as per Brian bumping the priority is
ugly and we need to make sure we don't conflict with userspace. One
solution (inspired by Guenter) is to just _always_ use a high priority
worker to do the transfer and just block waiting for the worker to
finish. This should be easy, reliable, and clean even if it is
slightly awkward. This also shouldn't be too hard to do, we think.
NOTE: a really truly long term fix for this is to change the protocol
to have reliably retries. That's planned but (as far as I can tell)
stalled. Details in the bug
<https://bugs.chromium.org/p/chromium/issues/detail?id=678675>
> I'm still not clear on why we see an error only on the first
> transaction after boot. In this case, the embedded controller
> previously handled host commands from firmware just fine, and the
> handoff between firmware and the kernel shouldn't be significant, from
> the EC point of view. If host command timing is consistent from the
> master, I would expect to see some constant error rate, eg. some
> chance any host command will fail, rather than the first host command
> always failing.
The AP itself is often quite busy at boot and so the timings for
everything change. That could easily explain the problems.
-Doug