Re: [PATCH v1 1/6] thermal: sysfs: Trigger zone temperature updates on sysfs reads

From: Lukasz Luba
Date: Mon May 13 2024 - 03:11:28 EST


Hi Rafael,

On 5/10/24 15:13, Rafael J. Wysocki wrote:
From: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>

Reading the zone temperature via sysfs causes the driver callback to
be invoked, but it does not cause the thermal zone object to be updated.

This is problematic if the zone temperature read via sysfs differs from
the temperature value stored in the thermal zone object as it may cause
the kernel and user space to act against each other in some cases.

For this reason, make temp_show() trigger a zone temperature update if
the temperature returned by thermal_zone_get_temp() is different from
the temperature value stored in the thermal zone object.

Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki@xxxxxxxxx>
---
drivers/thermal/thermal_core.c | 2 +-
drivers/thermal/thermal_sysfs.c | 3 +++
2 files changed, 4 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/thermal/thermal_sysfs.c
===================================================================
--- linux-pm.orig/drivers/thermal/thermal_sysfs.c
+++ linux-pm/drivers/thermal/thermal_sysfs.c
@@ -42,6 +42,9 @@ temp_show(struct device *dev, struct dev
if (ret)
return ret;
+ if (temperature != READ_ONCE(tz->temperature))
+ thermal_zone_device_update(tz, THERMAL_EVENT_TEMP_SAMPLE);

That's a bit problematic because it will trigger
governor->manage()

In case of IPA governor we relay on constant polling
period. We estimate the past power usage and current
thermal budget, to derive the next period power budget
for devices. I don't know if the internal PID algorithm
will be resilient enough to compensate this asynchronous
trigger caused from user-space.

We choose the period to be at least 1 frame (e.g. ~16ms)
to have good avg usage of CPUs and GPU. TBH I don't know
what would happen if someone reads the temp after e.g. 1ms
of last IPA trigger, but some devices (e.g. GPU) wasn't
loaded in that last 1ms delta...
I'm a bit more relaxed about CPUs because we use utilization
signal from runqueues (like the TEO util gov). That's a moving
avg signal which should keep some history, like low-pass
filter, so information is more resilient in that case.

Could we think about that IPA constant period usage?
I think I understand the proposal of your patch.
We might add a filter inside IPA to ignore such async
triggers in the .manage() callback.
What do you think?

Regards,
Lukasz