Re: [PATCH] arm64: dts: rockchip: Prevent thermal runaways in RK3308 SoC dtsi

From: Heiko Stübner
Date: Fri Oct 11 2024 - 05:57:46 EST


Am Freitag, 11. Oktober 2024, 11:04:38 CEST schrieb Dragan Simic:
> Hello Jonas,
>
> On 2024-10-11 10:52, Jonas Karlman wrote:
> > On 2024-10-10 12:19, Dragan Simic wrote:
> >> Until the TSADC, thermal zones, thermal trips and cooling maps are
> >> defined
> >> in the RK3308 SoC dtsi, none of the CPU OPPs except the slowest one
> >> may be
> >> enabled under any circumstances. Allowing the DVFS to scale the CPU
> >> cores
> >> up without even just the critical CPU thermal trip in place can rather
> >> easily
> >> result in thermal runaways and damaged SoCs, which is bad.
> >>
> >> Thus, leave only the lowest available CPU OPP enabled for now.
> >
> > This feel like a very aggressive limitation, to only allow the
> > opp-suspend rate, that is not even used under normal load.
> >
> > I let my Rock Pi S board with a RK3308B variant run "stress -c 8" for
> > around 10 hours and the reported temp only reach around 50-55 deg c,
> > ambient temp around 20 deg c and board laying flat on a table without
> > any enclosure or heat sink.
> >
> > This was running with performance as scaling_governor and cpu running
> > the 1008000 opp.
>
> Thanks for testing all that! That's very low CPU temperature under
> stress testing indeed. Maybe the cooling gets worse and the CPU
> temperature goes higher if the board is installed into some small
> enclosure with no natural or forced airflow?
>
> > Most RK3308 variants datasheets list 1.3 GHz as max rate for CPU,
> > the K-variant lists 1.2 GHz, and the -S-variants seem to have both
> > reduced voltage and max rate.
> >
> > The OPPs for this SoC already limits max rate to 1 GHz and is more than
> > likely good enough to not reach the max temperature of 115-125 deg c as
> > rated in datasheets and vendor DTs.
> >
> > Adding the tsadc and trips (same/similar as px30) will probably allow
> > us
> > to add/use the "missing" 1.2 and 1.3 GHz OPPs.
>
> With these insights, I agree that the patch might have been a bit
> too extreme, but it also promotes good practices when it comes to
> upstreaming. The general rule is not to add CPU or GPU OPPs with
> no proper thermal configuration already in place.
>
> The patch has already been merged, and as I already noted, [1] I'll
> try to implement, test and submit the proper thermal configuration
> ASAP. It's up Heiko to decide whether to drop this patch or not.

Hmm, interesting question ;-) .

Dropping the patch is of course still possible and so far we haven't
actually seen anyone with real-world problems.

And with Jonas' stress test, it does look like nobody will in the
(hopefully short) time till we have thermal management.

@Dragan, if you're in favor of that I'll drop the patch.


Heiko


>
> [1]
> https://lore.kernel.org/linux-rockchip/df92710498f66bcb4580cb2cd1573fb2@xxxxxxxxxxx/
>
> >> Fixes: 6913c45239fd ("arm64: dts: rockchip: Add core dts for RK3308
> >> SOC")
> >> Cc: stable@xxxxxxxxxxxxxxx
> >> Signed-off-by: Dragan Simic <dsimic@xxxxxxxxxxx>
> >> ---
> >> arch/arm64/boot/dts/rockchip/rk3308.dtsi | 3 +++
> >> 1 file changed, 3 insertions(+)
> >>
> >> diff --git a/arch/arm64/boot/dts/rockchip/rk3308.dtsi
> >> b/arch/arm64/boot/dts/rockchip/rk3308.dtsi
> >> index 31c25de2d689..a7698e1f6b9e 100644
> >> --- a/arch/arm64/boot/dts/rockchip/rk3308.dtsi
> >> +++ b/arch/arm64/boot/dts/rockchip/rk3308.dtsi
> >> @@ -120,16 +120,19 @@ opp-600000000 {
> >> opp-hz = /bits/ 64 <600000000>;
> >> opp-microvolt = <950000 950000 1340000>;
> >> clock-latency-ns = <40000>;
> >> + status = "disabled";
> >> };
> >> opp-816000000 {
> >> opp-hz = /bits/ 64 <816000000>;
> >> opp-microvolt = <1025000 1025000 1340000>;
> >> clock-latency-ns = <40000>;
> >> + status = "disabled";
> >> };
> >> opp-1008000000 {
> >> opp-hz = /bits/ 64 <1008000000>;
> >> opp-microvolt = <1125000 1125000 1340000>;
> >> clock-latency-ns = <40000>;
> >> + status = "disabled";
> >> };
> >> };
>