Re: [PATCH v3] platform/chrome: Use proper protocol transfer function
From: Shawn N
Date: Tue Oct 10 2017 - 11:33:10 EST
On Tue, Oct 10, 2017 at 6:35 AM, Jon Hunter <jonathanh@xxxxxxxxxx> wrote:
>
> On 26/09/17 00:15, Shawn N wrote:
>> On Wed, Sep 20, 2017 at 1:22 PM, Shawn N <shawnn@xxxxxxxxxx> wrote:
>>> On Tue, Sep 19, 2017 at 11:13 PM, Brian Norris <briannorris@xxxxxxxxxxxx> wrote:
>>>> Hi,
>>>>
>>>> On Tue, Sep 19, 2017 at 11:05:38PM -0700, Shawn N wrote:
>>>>> This is failing because our EC_CMD_GET_PROTOCOL_INFO host command is
>>>>> getting messed up, or the reply buffer is getting corrupted somehow.
>>>>>
>>>>> ec_dev->proto_version =
>>>>> min(EC_HOST_REQUEST_VERSION,
>>>>> fls(proto_info->protocol_versions) - 1);
>>>>>
>>>
>>> Checking this closer, the first host command we send after we boot the
>>> kernel (EC_CMD_GET_PROTOCOL_INFO) is failing due to protocol error
>>> (see 'SPI rx bad data' / 'SPI not ready' on the EC console). Since
>>> this doesn't seem to happen on the Chromium OS nyan_big release
>>> kernel, I suggest to hook up a logic analyzer and see if the SPI
>>> master is doing something bad.
>>>
>>> The error handling in cros_ec_cmd_xfer_spi() is completely wrong and
>>> we return -EAGAIN / EC_RES_IN_PROGRESS, which the caller interprets
>>> "the host command was received by the EC and is currently being
>>> handled, poll status until completion". So the caller polls status
>>> with EC_CMD_GET_COMMS_STATUS, sees no host command is in progress
>>> (which is interpreted to mean "the host command I sent previously has
>>> now successfully completed"), and returns success. The problem here is
>>> that the initial host command was never received at all, and no reply
>>> was ever received, so our reply data is all zero.
>>>
>>> Two things need to be fixed here:
>>>
>>> 1) Find out why the first host command after boot is failing. Probe
>>> SPI pins and see what's going on.
>>> 2) Fix error handling so we properly return an error (or properly
>>> retry the entire command) when a protocol error occurs (I made some
>>> attempt in https://chromium-review.googlesource.com/385080/, probably
>>> I should revisit that).
>>
>> The below patch will fix error handling and will make things mostly
>> work on nyan_big, because we'll fall back to V2 protocol after the
>> initial failure. But we should still investigate why we're getting
>> errors on the first host command. We aren't seeing these errors when
>> we send commands from firmware, so I suspect something is wrong in
>> kernel SPI HW initialization that causes the first command to fail.
> I have been looking into this a bit more and it appears to be timing
> related. I found that enabling some debug in the Tegra SPI driver the
> problem would go away and seems that adding a delay before sending the
> SPI message would also workaround the problem.
>
> Looking back at the Tegra Linux test history [0], it appears that after
> v4.10-rc5 [1] I start to see the following message which is the first
> indication of some SPI issues ...
>
> cros-ec-spi spi32766.0: packet too long (249 bytes, expected 4)
>
> I attempted to bisect this, but I was not successful because each time
> I would ended up somewhere different. I found that even with v4.9 I may
> see the issue 1 in 20 times and so I realised that I am not even sure
> when the problem really started or if has always been there. It seems
> that for older kernels it is harder to reproduce. I am wondering now if
> some timing has changed somewhere causing us to see the problem more
> frequently with newer kernels?
>
> I see the the cros-ec binding defines the following ...
>
> "google,cros-ec-spi-pre-delay: Some implementations of the EC need a
> little time to wake up from sleep before they can receive SPI transfers
> at a high clock rate. This property specifies the delay, in usecs,
> between the assertion of the CS to the start of the first clock pulse."
>
> I found that adding the following also worked around the problem ...
>
> diff --git a/arch/arm/boot/dts/tegra124-nyan.dtsi b/arch/arm/boot/dts/tegra124-nyan.dtsi
> index 5cf987b5401e..0baa6bfc0f36 100644
> --- a/arch/arm/boot/dts/tegra124-nyan.dtsi
> +++ b/arch/arm/boot/dts/tegra124-nyan.dtsi
> @@ -317,6 +317,7 @@
> interrupts = <TEGRA_GPIO(C, 7) IRQ_TYPE_LEVEL_LOW>;
> reg = <0>;
>
> + google,cros-ec-spi-pre-delay = <10>;
> google,cros-ec-spi-msg-delay = <2000>;
>
> i2c-tunnel {
>
> I have tried 50 boots with the above and I have seen no SPI failures
> on boot.
>
> I did look to see if it is possible to probe the SPI signals with a
> scope but from the schematics I am not sure if they are accessible or
> buried in the PCB.
>
> Is it possible that Tegra is sending the SPI message too soon for the
> EC?
I have not worked much with the cros_ec stm32 SPI host interface, but
I think it's possible. Also note that we define
"google,cros-ec-spi-pre-delay = <30>;" for the veyron family of
devices, which also use an stm32-family embedded controller. Some of
the folks on cc worked more on veyron and may have more insight.
I'm still not clear on why we see an error only on the first
transaction after boot. In this case, the embedded controller
previously handled host commands from firmware just fine, and the
handoff between firmware and the kernel shouldn't be significant, from
the EC point of view. If host command timing is consistent from the
master, I would expect to see some constant error rate, eg. some
chance any host command will fail, rather than the first host command
always failing.
>
> Cheers
> Jon
>
> [0] https://nvtb.github.io/linux/
> [1] https://nvtb.github.io/linux/test_v4.10-rc5/20170123023102/boot/tegra124-nyan-big/tegra124-nyan-big/tegra_defconfig_log.txt
>
> --
> nvpublic