Re: [RFC] ARM: dts: omap36xx: Enable thermal throttling

From: H. Nikolaus Schaller
Date: Fri Sep 13 2019 - 10:24:40 EST



> Am 13.09.2019 um 16:05 schrieb Adam Ford <aford173@xxxxxxxxx>:
>
> On Fri, Sep 13, 2019 at 8:32 AM H. Nikolaus Schaller <hns@xxxxxxxxxxxxx> wrote:
>>
>> Hi Adam,
>>
>>> Am 13.09.2019 um 13:07 schrieb Adam Ford <aford173@xxxxxxxxx>:
>>
>>>>> + cpu_cooling_maps: cooling-maps {
>>>>> + map0 {
>>>>> + trip = <&cpu_alert0>;
>>>>> + /* Only allow OPP50 and OPP100 */
>>>>> + cooling-device = <&cpu 0 1>;
>>>>
>>>> omap4-cpu-thermal.dtsi uses THERMAL_NO_LIMIT constants but I do not
>>>> understand their meaning (and how it relates to the opp list).
>>>
>>> I read through the documentation, but it wasn't completely clear to
>>> me. AFAICT, the numbers after &cpu represent the min and max index in
>>> the OPP table when the condition is hit.
>>
>> Ok. It seems to use "cooling state" for those and the first is minimum
>> and the last is maximum. Using THERMAL_NO_LIMIT (-1UL) means to have
>> no limits.
>>
>> Since here we use the &cpu node it is likely that the "cooling state"
>> is the same as the OPP index currently in use.
>>
>> I have looked through the .dts which use cpu_crit and the picture is
>> not unique...
>>
>> omap4 seems to only define it
>> am57xx has two different grade dtsi files
>> dra7 overwrites critical temperature value
>> am57xx-beagle defines a gpio to control a fan
>
> Checkout rk3288-veyron-mickey.dts
>
> They have almost_warm, warm, almost_hot, hot, hotter, very_hot, and
> critical for trips, and they have as many corresponding cooling maps
> which appear to limit the CPU speeds, but their index references are
> still confusing to me.

Seems to be quite sophistcated.

The arch/arm/boot/dts/exynos5422-odroidxu3-common.dtsi also has a lot
of trip points. So there may be very different needs...

But it has potentially helpful comments...

/*
* When reaching cpu0_alert3, reduce CPU
* by 2 steps. On Exynos5422/5800 that would
* be: 1600 MHz and 1100 MHz.
*/
map3 {
trip = <&cpu0_alert3>;
cooling-device = <&cpu0 0 2>;
};
map4 {
trip = <&cpu0_alert3>;
cooling-device = <&cpu4 0 2>;
};
/*
* When reaching cpu0_alert4, reduce CPU
* further, down to 600 MHz (12 steps for big,
* 7 steps for LITTLE).
*/
map5 {
trip = <&cpu0_alert4>;
cooling-device = <&cpu0 3 7>;
};
map6 {
trip = <&cpu0_alert4>;
cooling-device = <&cpu4 3 12>;
};

That would mean the second integer is something about how
many steps to reduce.

But the first is not explained.

BTW: this also demonstrates how a single trip point can map to multiple
cooling-device actions (something we likely do not need).

>
> For that device,
> Warm and no limit first, then 4: coolling-device = <&cpu0 THERMAL_NO_LIMIT 4>
> ...
> very_hot uses a number then no limit: cooling-device = <&cpu0 8
> THERMAL_NO_LIMIT>
>
> This makes me wonder if the min and max are switched or the index
> values go backwards.

It may depend on the specific cpu driver? Maybe even omap rk and exynos
have different interpretation in code?

>>
>> Then we can use the data sheet limits of 90ÂC and 105ÂC in the trip point
>> table (which should not be tweaked for sensor inaccuracy).
>
> I can see not compensating if it reads high, but if the temp reads
> low, shouldn't compensate so we don't over temp the processor?

I just mean that we must ensure that the TJ is <= 90Â if the bandgap
ever reports 90Â. So it may report 10 or 20 or even 30 degrees more than the
real temperature but never less (reaching the critical temperature too early
but not too late).

We can achieve that by adding bias or changing slope etc. in the bandgap sensor
driver.

If I find some time I am curious enough to look into the code and the data
sheets to understand why it is said to be inaccurate... Maybe there is
jitter from some LDO and it needs a median filter?

And why it seems to add a bias of ca. 10Â as soon as I read it more than
for the first time. And how well temperature correlates to ambient temperature
(it definitively correlates to cpufreq-set -f).

But we should not modify the trip temperatures by 10 or 20 or 30 degrees.
IMHO they should have the values defined by the data sheet.

BR,
Nikolaus