Re: [PATCH v5 6/6] arm64: dts: qcom: Enable cpu cooling devices for QCS9075 platforms
From: Dmitry Baryshkov
Date: Mon Jan 13 2025 - 03:44:29 EST
On Fri, Jan 10, 2025 at 01:31:00PM +0100, Konrad Dybcio wrote:
> On 10.01.2025 12:54 AM, Dmitry Baryshkov wrote:
> > On Wed, Jan 08, 2025 at 09:38:19PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> >>
> >> Hi Dmitry,
> >>
> >>
> >> On 1/8/2025 6:16 PM, Dmitry Baryshkov wrote:
> >>> On Wed, Jan 08, 2025 at 05:57:06PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> >>>> Hi Dmitry,
> >>>>
> >>>>
> >>>> On 1/3/2025 11:21 AM, Dmitry Baryshkov wrote:
> >>>>> On Tue, Dec 31, 2024 at 05:31:41PM +0530, Manaf Meethalavalappu Pallikunhi wrote:
> >>>>>> Hi Dmitry,
> >>>>>>
> >>>>>> On 12/30/2024 9:10 PM, Dmitry Baryshkov wrote:
> >>>>>>> On Sun, Dec 29, 2024 at 08:53:32PM +0530, Wasim Nazir wrote:
> >>>>>>>> From: Manaf Meethalavalappu Pallikunhi <quic_manafm@xxxxxxxxxxx>
> >>>>>>>>
> >>>>>>>> In QCS9100 SoC, the safety subsystem monitors all thermal sensors and
> >>>>>>>> does corrective action for each subsystem based on sensor violation
> >>>>>>>> to comply safety standards. But as QCS9075 is non-safe SoC it
> >>>>>>>> requires conventional thermal mitigation to control thermal for
> >>>>>>>> different subsystems.
> >>>>>>>>
> >>>>>>>> The cpu frequency throttling for different cpu tsens is enabled in
> >>>>>>>> hardware as first defense for cpu thermal control. But QCS9075 SoC
> >>>>>>>> has higher ambient specification. During high ambient condition, even
> >>>>>>>> lowest frequency with multi cores can slowly build heat over the time
> >>>>>>>> and it can lead to thermal run-away situations. This patch restrict
> >>>>>>>> cpu cores during this scenario helps further thermal control and
> >>>>>>>> avoids thermal critical violation.
> >>>>>>>>
> >>>>>>>> Add cpu idle injection cooling bindings for cpu tsens thermal zones
> >>>>>>>> as a mitigation for cpu subsystem prior to thermal shutdown.
> >>>>>>>>
> >>>>>>>> Add cpu frequency cooling devices that will be used by userspace
> >>>>>>>> thermal governor to mitigate skin thermal management.
> >>>>>>> Does anything prevent us from having this config as a part of the basic
> >>>>>>> sa8775p.dtsi setup? If HW is present in the base version but it is not
> >>>>>>> accessible for whatever reason, please move it the base device config
> >>>>>>> and use status "disabled" or "reserved" to the respective board files.
> >>>>>> Sure, I will move idle injection node for each cpu to sa8775p.dtsi and keep
> >>>>>> it disabled state. #cooling cells property for CPU, still wanted to keep it
> >>>>>> in board files as we don't want to enable any cooling device in base DT.
> >>>>> "we don't want" is not a proper justification. So, no.
> >>>> As noted in the commit, thermal cooling mitigation is only necessary for
> >>>> non-safe SoCs. Adding this cooling cell property to the CPU node in the base
> >>>> DT (sa8775p.dtsi), which is shared by both safe and non-safe SoCs, would
> >>>> violate the requirements for safe SoCs. Therefore, we will include it only
> >>>> in non-safe SoC boards.
> >>> "is only necessary" is fine. It means that it is an optional part which
> >>> is going to be unused / ignored / duplicate functionality on the "safe"
> >>> SoCs. What kind of requirement is going to be violated in this way?
> >>
> >> From the perspective of a safe SoC, any software mitigation that compromises
> >> the safety subsystem’s compliance should not be allowed. Enabling the
> >> cooling device also opens up the sysfs interface for userspace, which we may
> >> not fully control.
> >
> > THere are a lot of interfaces exported to the userspace.
> >
> >> Userspace apps or partner apps might inadvertently use
> >> it. Therefore, we believe it is better not to expose such an interface, as
> >> it is not required for that SoC and helps to avoid opening up an interface
> >> that could potentially lead to a safety failure.
> >
> > How can thermal mitigation interface lead to safety failure? Userspace
> > can possibly lower trip points, but it can not override existing
> > firmware-based mitigation.
> > And if there is a known problem with the interface, it should be fixed
> > instead.
>
> I think the intended case to avoid is where a malicious actor would set
> the trips too low, resulting in throttling down the CPU to FMIN / Linux
> throttling CPUs to try and escape what it believes to be possible thermal
> runaway / a system reboot. Not something desired in a car.
Being able to set trip points via sysfs means that the system is already
compromised. At this point it can do whatever the actor wants - e.g.
display malicious HUD or just a gren bar or black screen, scream into
dynamic, etc. That doesn't sound like the temperature trip points being
the only or the major problem of a car.
Anyway, if that's really the only problem, please use the framework to
make the temperature and hysteresis of the trip point R/O for sa8775p /
qcs9100. Other attributes might need to be made R/O too. It well might
be that I'm missing one of the automotive peculiarties here. In such a
case the commit message should be more explicit that it's AGL or some
other requirement and provide a link.
--
With best wishes
Dmitry