Re: [PATCH] thermal: sun8i: Be loud when probe fails

From: Maxime Ripard
Date: Wed Jul 08 2020 - 09:37:04 EST


On Wed, Jul 08, 2020 at 03:29:24PM +0200, OndÅej Jirman wrote:
> Hello Maxime,
>
> On Wed, Jul 08, 2020 at 02:25:42PM +0200, Maxime Ripard wrote:
> > Hi,
> >
> > On Wed, Jul 08, 2020 at 12:55:27PM +0200, Ondrej Jirman wrote:
> > > I noticed several mobile Linux distributions failing to enable the
> > > thermal regulation correctly, because the kernel is silent
> > > when thermal driver fails to probe. Add enough error reporting
> > > to debug issues and warn users in case thermal sensor is failing
> > > to probe.
> > >
> > > Failing to notify users means, that SoC can easily overheat under
> > > load.
> > >
> > > Signed-off-by: Ondrej Jirman <megous@xxxxxxxxxx>
> > > ---
> > > drivers/thermal/sun8i_thermal.c | 55 ++++++++++++++++++++++++++-------
> > > 1 file changed, 43 insertions(+), 12 deletions(-)
> > >
> > > diff --git a/drivers/thermal/sun8i_thermal.c b/drivers/thermal/sun8i_thermal.c
> > > index 74d73be16496..9065e79ae743 100644
> > > --- a/drivers/thermal/sun8i_thermal.c
> > > +++ b/drivers/thermal/sun8i_thermal.c
> > > @@ -287,8 +287,12 @@ static int sun8i_ths_calibrate(struct ths_device *tmdev)
> > >
> > > calcell = devm_nvmem_cell_get(dev, "calibration");
> > > if (IS_ERR(calcell)) {
> > > + dev_err(dev, "Failed to get calibration nvmem cell (%ld)\n",
> > > + PTR_ERR(calcell));
> > > +
> > > if (PTR_ERR(calcell) == -EPROBE_DEFER)
> > > return -EPROBE_DEFER;
> > > +
> >
> > The rest of the patch makes sense, but we should probably put the error
> > message after the EPROBE_DEFER return so that we don't print any extra
> > noise that isn't necessarily useful
>
> I thought about that, but in this case this would have helped, see my other
> e-mail. Though lack of "probe success" message may be enough for me, to
> debug the issue, I'm not sure the user will notice that a message is missing, while
> he'll surely notice if there's a flood of repeated EPROBE_DEFER messages.

Yeah, but on the other hand, we regularly have people that come up and
ask if a "legitimate" EPROBE_DEFER error message (as in: the driver
wasn't there on the first attempt but was there on the second) is a
cause of concern or not.

> And people run several distros for 3-4 months without anyone noticing any
> issues and that thermal regulation doesn't work. So it seems that lack of a
> success message is not enough.

I understand what the issue is, but do you really expect phone users to
monitor the kernel logs every time they boot their phone to see if the
thermal throttling is enabled?

If anything, it looks like a distro problem, and the notification /
policy to deal with that should be implemented in userspace.

> Other solution may be to select CONFIG_NVMEM_SUNXI_SID if this driver
> is enabled. That may get rid of this error scenario of waiting infinitely
> for calibration data with EPROBE_DEFER. And other potential EPROBE_DEFER sources
> will probably be quite visible even without this driver telling the user.
> So this message may not be necessary in that case.

That would only partially solve your issue. If the nvmem driver doesn't
load for some reason, you would end up in a similar situation.

Maxime