Re: [PATCH V2 09/10] thermal: da9062/61: Thermal junction temperature monitoring driver

From: Eduardo Valentin
Date: Wed Nov 30 2016 - 01:10:16 EST


Hey Steve,

On Tue, Nov 29, 2016 at 11:11:59AM +0000, Steve Twiss wrote:
> Hi Eduardo,
>
> Thanks for your response.
>
> On 29 November 2016 01:24, Eduardo Valentin, wrote:
>
> > On Wed, Oct 26, 2016 at 05:56:39PM +0100, Steve Twiss wrote:
> > > +config DA9062_THERMAL
> > > + tristate "DA9062/DA9061 Dialog Semiconductor thermal driver"
> > > + depends on MFD_DA9062
> > > + depends on OF
> > > + help
> > > + Enable this for the Dialog Semiconductor thermal sensor driver.
> > > + This will report PMIC junction over-temperature for one thermal trip
> > > + zone.
> > > + Compatible with the DA9062 and DA9061 PMICs.
> >
> > Is there any hardware documentation available for this chip that can be
> > pointed out?
>
> As part of this patch set, I added a link to the the datasheet into the top-level MFD
> component of the device tree binding: https://patchwork.kernel.org/patch/9426791/
> You can get all the information for DA9062 and DA9061 from the patch update in that
> link.

Thanks for getting me up to speed.

>
> [...]

<cut>

> > Does this mean that the chip temperature is above or equal to 125C, is
> > this really a safe temperature to keep it running?
>
> There is more information in the datasheet under the section titles "Junction
> temperature supervision". The value of 125 degC comes from the "warning
> temperature threshold" and the event is triggered when "the junction temperature
> rises above the first threshold (TEMP_WARN), the event E_TEMP is asserted".
>
> This suggests strictly greater than 125. However, there is no way for the chip to
> know the exact temperature. The mechanism is interrupt based and triggering
> only happens when the temperature rises above the threshold level.

Understood. My original concern was if the driver would be too nice with
the event. But given that the hardware has its own shutdown protection,
looks like we are OK here.

>
> [...]
> > > +static irqreturn_t da9062_thermal_irq_handler(int irq, void *data)
> > > +{
> > > + struct da9062_thermal *thermal = data;
> > > +
> > > + disable_irq_nosync(thermal->irq);
> > > + schedule_delayed_work(&thermal->work, 0);
> >
> > Can you please elaborate a little on why you have an one shot threaded IRQ
> > handler that is only disabling the IRQ then, scheduling a work with delay of 0?
> >
> > To my understanding, this is exactly what you get when you run your
> > threaded IRQ handler, when you configure it as one shot.
> >
> > Have you tried simply handling the same work done in your workqueue
> > handler in your threaded IRQ? That should simplify your code and get the
> > same result you are expecting.
> >
> > If you are not getting the IRQ disabled on the threaded IRQ when
> > configured as one shot, something else seams to be broken.
>
> Over-temperature triggering is event based: an interrupt happens when the
> temperature rises above 125 degC. However, this event based system changes into
> a polling operation to detect when the temperature falls below the threshold
> level again. This asymmetry in the chip's behaviour is the reasoning behind
> why I am not using the thermal core's built-in polling functionality.
>
> When over-temperature is reached, the interrupt from the DA9061/2 will be
> repeatedly triggered. The IRQ is disabled when the first over-temperature event
> happens and the status bit is polled using the work-queue until it becomes false.
> After that, the IRQ is re-enabled again so a new critical temperature event can
> be detected.
>
> After the interrupt has happened, event bit for the interrupt switches into a status
> bit and is used for polling and detecting when the temperature drops below the
> threshold.

OK. got it. Then, I am assuming your strategy here is to keep periodically issuing uevents
(HOT trip point) to userland, hoping that something would change the
system power consumption, then, relying on the hardware shutdown protection
if nothing happens.

I would say, your above explanation, and the uevent based strategy,
deserves to be at least in the commit message, preferably in the driver
documentation, so people know what to expect from the driver.

Now, if my understand is correct, would it make sense to have still a
failsafe mechanism in the driver? Maybe having a max number of polling?

The data sheet does not mention anything, but does one have any silicon
lifetime implication if one leaves the PMIC running for very long time
in a temperature between Twarn and Tcrit?

>
> https://lkml.org/lkml/2016/10/20/372
> https://lkml.org/lkml/2016/10/20/433
>
> [...]
> > > + thermal->zone = thermal_zone_device_register(thermal->config->name,
> > > + 1, 0, thermal,
> > > + &da9062_thermal_ops, NULL, 0,
> > > + 0);
> >
> > Did you try using of-thermal?
> >
> > So you would avoid having specific DT properties for something that
> > there is already a defined property?
>
> In an earlier RFC, I examined a work-around by hijacking and toggling the
> thermal core's built-in polling function when I needed to poll the temperature
> through get_temp(). However I thought this was quite dangerous, since it would
> not be using a formal thermal core interface.
>
> https://patchwork.kernel.org/patch/9387241/
>

my point was more under the lines of having this driver using the
of-thermal DT support to get the polling value, instead of you having a
manufacturer specific property:
Documentation/devicetree/bindings/thermal/thermal.txt

But given that your case is more specific, I start to see why you
avoided using it. But still, you could probably get the polling
values from of-thermal, populated in the tz struct, then using them from
the tz, when handling the IRQ event.

Probably your regular polling should always be set to 0, and the passive
polling to the value you want.

then again, this comment is more from a DT perspective than from the
driver code itself. Just trying to avoid specific properties that may
get described by what is already defined.

> [...]
> > > + ret = request_threaded_irq(thermal->irq, NULL,
> > > + da9062_thermal_irq_handler,
> > > + IRQF_TRIGGER_LOW | IRQF_ONESHOT,
> > > + "THERMAL", thermal);
> >
> > How about using the devm_ version?
>
> I wanted to explicitly free_irq() before calling thermal_zone_device_unregister()
> in the remove function.
>

Ok

> Regards,
> Steve