Re: [PATCH v1 03/10] ARM: tegra: acer-a500: Bump thermal trips by 10C
From: Dmitry Osipenko
Date: Sat Jun 12 2021 - 20:26:54 EST
12.06.2021 17:24, Daniel Lezcano пишет:
> On 12/06/2021 12:40, Dmitry Osipenko wrote:
>> 11.06.2021 12:52, Daniel Lezcano пишет:
>>> On 14/05/2021 23:16, Michał Mirosław wrote:
>>>> On Mon, May 10, 2021 at 11:25:53PM +0300, Dmitry Osipenko wrote:
>>>>> It's possible to hit the temperature of the thermal zone in a very warm
>>>>> environment under a constant load, like watching a video using software
>>>>> decoding. It's even easier to hit the limit with a slightly overclocked
>>>>> CPU. Bump the temperature limit by 10C in order to improve user
>>>>> experience. Acer A500 has a large board and 10" display panel which are
>>>>> used for the heat dissipation, the SoC is placed far away from battery,
>>>>> hence we can safely bump the temperature limit.
>>>>
>>>> 60^C looks like a touch-safety limit (to avoid burns for users). Did you
>>>> verify the touchable parts' temperature somehow after the change?
>>>
>>> The skin temperature and the CPU/GPU etc ... temperatures are different
>>> things.
>>>
>>> For the embedded system there is the dissipation system and a
>>> temperature sensor on it which is the skin temp. This temperature is the
>>> result of the heat of all the thermal zones on the board and must be
>>> below 45°C. The temperature slowly changes.
>>>
>>> On the CPU, the temperature changes can be very fast and you have to
>>> take care of keeping it below the max temperature specified in the TRM
>>> by using different techniques (freq changes, idle injection, ...) but
>>> the temperature can be 75°C, 85°C or whatever the manual says.
>>>
>>> 50°C and 60°C are low temperature for a CPU and that will inevitably
>>> impact the performances, so setting the temperature close the max
>>> temperature is what will allow max performances.
>>>
>>> What matters is the skin temperature.
>>>
>>> The skin temperature must be monitored by other techniques, eg. using
>>> the TDP of the system and throttle the different devices to keep them in
>>> this power budget. That is the role of an thermal daemon.
>>
>> Thank you for the clarification. Indeed, I wasn't sure how to make use
>> of the skin temperature properly.
>>
>> The skin temperature varies a lot depending on the thermal capabilities
>> of a particular device. It's about 15C below CPU core at a full load on
>> A500, while it's 2C below CPU core on Nexus 7. But this is expected
>> since Nexus 7 can't dissipate heat efficiently.
> Yeah, but it can not be directly related to the CPU because if the GPU
> is intensively used and the battery is charging at the same time, the
> skin temp will increase anyway.
Sure, we just added the memory devfreq throttling as a cooling device to
Nexus 7 and Ouya DTs in addition to the CPU throttling.
The GPU and other h/w units are on the pending list. For the starter we
need to add GENPD and runtime PM support to all drivers, which solves
the overheating problem of idling systems. We have Tegra30 Ouya game
console that is getting hot during idle without the runtime PM support.
Afterwards we can add the devfreq support to improve the active cooling.
I'm already working on it.
> You should set the trip points close to the functioning boundary
> temperature given in the hardware specification whatever the resulting
> heating effect is on the device.
>
> The thermal zone is there to protect the silicon and the system from a
> wild reboot.
>
> If the Nexus 7 is too hot after the changes, then you may act on the
> sources of the heat. For instance, set the the highest OPP to turbo or
> remove it, or, if there is one, change the thermal daemon to reduce the
> overall power consumption.
> In case you are interested in: https://lwn.net/Articles/839318/
The DTPM is a very interesting approach. For now Tegra still misses some
basics in mainline kernel which have a higher priority, so I think it
should be good enough to perform the in-kernel thermal management for
the starter. We may consider a more complex solutions later on if will
be necessary.
What I'm currently thinking to do is:
1. Set up the trips of SoC/CPU core thermal zones in accordance to the
silicon limits.
2. Set up the skin trips in accordance to the device limits.
The breached skin trips will cause a mild throttling, while the SoC/CPU
trips will be allowed to cause the severe throttling. Does this sound
good to you?
> Hope that helps
Helps a lot, thank you very much.