Re: [PATCH v2 1/2] hwmon: add ChromeOS EC driver

From: Guenter Roeck
Date: Wed May 29 2024 - 13:00:49 EST


On Wed, May 29, 2024 at 12:40 AM Stephen Horvath
<s.horvath@xxxxxxxxxxxxxx> wrote:
>
> Hi Thomas,
>
> On 29/5/24 16:23, Thomas Weißschuh wrote:
> > On 2024-05-29 10:58:23+0000, Stephen Horvath wrote:
> >> On 29/5/24 09:29, Guenter Roeck wrote:
> >>> On 5/28/24 09:15, Thomas Weißschuh wrote:
> >>>> On 2024-05-28 08:50:49+0000, Guenter Roeck wrote:
> >>>>> On 5/27/24 17:15, Stephen Horvath wrote:
> >>>>>> On 28/5/24 05:24, Thomas Weißschuh wrote:
> >>>>>>> On 2024-05-25 09:13:09+0000, Stephen Horvath wrote:
> >>>>>>>> Don't forget it can also return `EC_FAN_SPEED_STALLED`.
> >
> > <snip>
> >
> >>>>>>>
> >>>>>>> Thanks for the hint. I'll need to think about how to
> >>>>>>> handle this better.
> >>>>>>>
> >>>>>>>> Like Guenter, I also don't like returning `-ENODEV`,
> >>>>>>>> but I don't have a
> >>>>>>>> problem with checking for `EC_FAN_SPEED_NOT_PRESENT`
> >>>>>>>> in case it was removed
> >>>>>>>> since init or something.
> >>>>>>>
> >>>>>
> >>>>> That won't happen. Chromebooks are not servers, where one might
> >>>>> be able to
> >>>>> replace a fan tray while the system is running.
> >>>>
> >>>> In one of my testruns this actually happened.
> >>>> When running on battery, one specific of the CPU sensors sporadically
> >>>> returned EC_FAN_SPEED_NOT_PRESENT.
> >>>>
> >>>
> >>> What Chromebook was that ? I can't see the code path in the EC source
> >>> that would get me there.
> >>>
> >>
> >> I believe Thomas and I both have the Framework 13 AMD, the source code is
> >> here:
> >> https://github.com/FrameworkComputer/EmbeddedController/tree/lotus-zephyr
> >
> > Correct.
> >
> >> The organisation confuses me a little, but Dustin has previous said on the
> >> framework forums (https://community.frame.work/t/what-ec-is-used/38574/2):
> >>
> >> "This one is based on the Zephyr port of the ChromeOS EC, and tracks
> >> mainline more closely. It is in the branch lotus-zephyr.
> >> All of the model-specific code lives in zephyr/program/lotus.
> >> The 13"-specific code lives in a few subdirectories off the main tree named
> >> azalea."
> >
> > The EC code is at [0]:
> >
> > $ ectool version
> > RO version: azalea_v3.4.113353-ec:b4c1fb,os
> > RW version: azalea_v3.4.113353-ec:b4c1fb,os
> > Firmware copy: RO
> > Build info: azalea_v3.4.113353-ec:b4c1fb,os:7b88e1,cmsis:4aa3ff 2024-03-26 07:10:22 lotus@ip-172-26-3-226
> > Tool version: 0.0.1-isolate May 6 2024 none
>
> I can confirm mine is the same build too.
>
> > From the build info I gather it should be commit b4c1fb, which is the
> > current HEAD of the lotus-zephyr branch.
> > Lotus is the Framework 16 AMD, which is very similar to Azalea, the
> > Framework 13 AMD, which I tested this against.
> > Both share the same codebase.
> >
> >> Also I just unplugged my fan and you are definitely correct, the EC only
> >> generates EC_FAN_SPEED_NOT_PRESENT for fans it does not have the capability
> >> to support. Even after a reboot it just returns 0 RPM for an unplugged fan.
> >> I thought about simulating a stall too, but I was mildly scared I was going
> >> to break one of the tiny blades.
> >
> > I get the error when unplugging *the charger*.
> >
> > To be more precise:
> >
> > It does not happen always.
> > It does not happen instantly on unplugging.
> > It goes away after a few seconds/minutes.
> > During the issue, one specific sensor reads 0xffff.
> >
>
> Oh I see, I haven't played around with the temp sensors until now, but I
> can confirm the last temp sensor (cpu@4c / temp4) will randomly (every
> ~2-15 seconds) return EC_TEMP_SENSOR_ERROR (0xfe).
> Unplugging the charger doesn't seem to have any impact for me.
> The related ACPI sensor also says 180.8°C.
> I'll probably create an issue or something shortly.
>
> I was mildly confused by 'CPU sensors' and 'EC_FAN_SPEED_NOT_PRESENT' in
> the same sentence, but I'm now assuming you mean the temp sensor?
>

Same here. it might not matter as much if the values were the same,
but EC_FAN_SPEED_NOT_PRESENT == 0xffff, and
EC_TEMP_SENSOR_NOT_PRESENT==0xff, so they must not be confused with
each other. EC_TEMP_SENSOR_NOT_PRESENT should be static as well,
though, and not be returned randomly.

Guenter

> >>>>>>> Ok.
> >>>>>>>
> >>>>>>>> My approach was to return the speed as `0`, since
> >>>>>>>> the fan probably isn't
> >>>>>>>> spinning, but set HWMON_F_FAULT for `EC_FAN_SPEED_NOT_PRESENT` and
> >>>>>>>> HWMON_F_ALARM for `EC_FAN_SPEED_STALLED`.
> >>>>>>>> No idea if this is correct though.
> >>>>>>>
> >>>>>>> I'm not a fan of returning a speed of 0 in case of errors.
> >>>>>>> Rather -EIO which can't be mistaken.
> >>>>>>> Maybe -EIO for both EC_FAN_SPEED_NOT_PRESENT (which
> >>>>>>> should never happen)
> >>>>>>> and also for EC_FAN_SPEED_STALLED.
> >>>>>>
> >>>>>> Yeah, that's pretty reasonable.
> >>>>>>
> >>>>>
> >>>>> -EIO is an i/o error. I have trouble reconciling that with
> >>>>> EC_FAN_SPEED_NOT_PRESENT or EC_FAN_SPEED_STALLED.
> >>>>>
> >>>>> Looking into the EC source code [1], I see:
> >>>>>
> >>>>> EC_FAN_SPEED_NOT_PRESENT means that the fan is not present.
> >>>>> That should return -ENODEV in the above code, but only for
> >>>>> the purpose of making the attribute invisible.
> >>>>>
> >>>>> EC_FAN_SPEED_STALLED means exactly that, i.e., that the fan
> >>>>> is present but not turning. The EC code does not expect that
> >>>>> to happen and generates a thermal event in case it does.
> >>>>> Given that, it does make sense to set the fault flag.
> >>>>> The actual fan speed value should then be reported as 0 or
> >>>>> possibly -ENODATA. It should _not_ generate any other error
> >>>>> because that would trip up the "sensors" command for no
> >>>>> good reason.
> >>>>
> >>>> Ack.
> >>>>
> >>>> Currently I have the following logic (for both fans and temp):
> >>>>
> >>>> if NOT_PRESENT during probing:
> >>>> make the attribute invisible.
> >>>>
> >>>> if any error during runtime (including NOT_PRESENT):
> >>>> return -ENODATA and a FAULT
> >>>>
> >>>> This should also handle the sporadic NOT_PRESENT failures.
> >>>>
> >>>> What do you think?
> >>>>
> >>>> Is there any other feedback to this revision or should I send the next?
> >>>>
> >>>
> >>> No, except I'd really like to know which Chromebook randomly generates
> >>> a EC_FAN_SPEED_NOT_PRESENT response because that really looks like a bug.
> >>> Also, can you reproduce the problem with the ectool command ?
> >
> > Yes, the ectool command reports the same issue at the same time.
> >
> > The fan affected was always the sensor cpu@4c, which is
> > compatible = "amd,sb-tsi".
> >
> >> I have a feeling it was related to the concurrency problems between ACPI and
> >> the CrOS code that are being fixed in another patch by Ben Walsh, I was also
> >> seeing some weird behaviour sometimes but I *believe* it was fixed by that.
> >
> > I don't think it's this issue.
> > Ben's series at [1], is for MEC ECs which are the older Intel
> > Frameworks, not the Framework 13 AMD.
>
> Yeah sorry, I saw it mentioned AMD and threw it into my kernel, I also
> thought it stopped the 'packet too long' messages (for
> EC_CMD_CONSOLE_SNAPSHOT) but it did not.
>
> Thanks,
> Steve