Re: [PATCH v3] platform/chrome: Use proper protocol transfer function

From: Jon Hunter
Date: Tue Oct 10 2017 - 09:39:08 EST



On 26/09/17 00:15, Shawn N wrote:
> On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@xxxxxxxxxx> wrote:
>> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@xxxxxxxxxxxx> wrote:
>>> Hi,
>>>
>>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>>
>>>> ec_dev->proto_version =
>>>> min(EC_HOST_REQUEST_VERSION,
>>>> fls(proto_info->protocol_versions) - 1);
>>>>
>>
>> Checking this closer, the first host command we send after we boot the
>> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
>> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
>> this doesn't seem to happen on the Chromium OS nyan_big release
>> kernel, I suggest to hook up a logic analyzer and see if the SPI
>> master is doing something bad.
>>
>> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
>> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
>> "the host command was received by the EC and is currently being
>> handled, poll status until completion". So the caller polls status
>> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
>> (which is interpreted to mean "the host command I sent previously has
>> now successfully completed"), and returns success. The problem here is
>> that the initial host command was never received at all, and no reply
>> was ever received, so our reply data is all zero.
>>
>> Two things need to be fixed here:
>>
>> 1) Find out why the first host command after boot is failing. Probe
>> SPI pins and see what's going on.
>> 2) Fix error handling so we properly return an error (or properly
>> retry the entire command) when a protocol error occurs (I made some
>> attempt in https://chromium-review.googlesource.com/385080/, probably
>> I should revisit that).
>
> The below patch will fix error handling and will make things mostly
> work on nyan_big, because we'll fall back to V2 protocol after the
> initial failure. But we should still investigate why we're getting
> errors on the first host command. We aren't seeing these errors when
> we send commands from firmware, so I suspect something is wrong in
> kernel SPI HW initialization that causes the first command to fail.
I have been looking into this a bit more and it appears to be timing
related. I found that enabling some debug in the Tegra SPI driver the
problem would go away and seems that adding a delay before sending the
SPI message would also workaround the problem.

Looking back at the Tegra Linux test history [0], it appears that after
v4.10-rc5 [1] I start to see the following message which is the first
indication of some SPI issues ...

cros-ec-spi spi32766.0: packet too long (249 bytes, expected 4)

I attempted to bisect this, but I was not successful because each time
I would ended up somewhere different. I found that even with v4.9 I may
see the issue 1 in 20 times and so I realised that I am not even sure
when the problem really started or if has always been there. It seems
that for older kernels it is harder to reproduce. I am wondering now if
some timing has changed somewhere causing us to see the problem more
frequently with newer kernels?

I see the the cros-ec binding defines the following ...

"google,cros-ec-spi-pre-delay: Some implementations of the EC need a
little time to wake up from sleep before they can receive SPI transfers
at a high clock rate. This property specifies the delay, in usecs,
between the assertion of the CS to the start of the first clock pulse."

I found that adding the following also worked around the problem ...

diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
index 5cf987b5401e..0baa6bfc0f36 100644
--- a/arch/arm/boot/dts/tegra124-nyan.dtsi
+++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
@@ -317,6 +317,7 @@
interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
reg = <0>;

+ google,cros-ec-spi-pre-delay = <10>;
google,cros-ec-spi-msg-delay = <2000>;

i2c-tunnel {

I have tried 50 boots with the above and I have seen no SPI failures
on boot.

I did look to see if it is possible to probe the SPI signals with a
scope but from the schematics I am not sure if they are accessible or
buried in the PCB.

Is it possible that Tegra is sending the SPI message too soon for the
EC?

Cheers
Jon

[0] https://nvtb.github.io/linux/
[1] https://nvtb.github.io/linux/test_v4.10-rc5/20170123023102/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt

--
nvpublic