Re: [REGRESSION] 774ac8b7eff6 ("Thermal: initialize thermal zone device correctly") causes performance drop

From: Laura Abbott
Date: Fri Mar 18 2016 - 18:28:59 EST


(bringing this back to the main thread)

On 03/16/2016 05:20 PM, Pandruvada, Srinivas wrote:
On Wed, 2016-03-16 at 17:00 -0700, Laura Abbott wrote:
On 03/16/2016 03:46 PM, Greg Kroah-Hartman wrote:
On Wed, Mar 16, 2016 at 03:27:57PM -0700, Laura Abbott wrote:
Hi,

Fedora received a bug report (https://bugzilla.redhat.com/show_bu
g.cgi?id=1317190)
of a major performance drop on various bench marks and general
system
sluggishness with the 4.4.4 kernel update. The benchmarks were
showing
a reduction to about 18% performance (not minor).

Bisection showed the first bad commit was

commit 774ac8b7eff69e0786970157de2157e68b22f456
Author: Zhang Rui <rui.zhang@xxxxxxxxx>
Date: Fri Oct 30 16:31:47 2015 +0800

Thermal: initialize thermal zone device correctly
commit bb431ba26c5cd0a17c941ca6c3a195a3a6d5d461 upstream.
After thermal zone device registered, as we have not read
any
temperature before, thus tz->temperature should not be 0,
which actually means 0C, and thermal trend is not available.
In this case, we need specially handling for the first
thermal_zone_device_update().
Both thermal core framework and step_wise governor is
enhanced to handle this. And since the step_wise governor
is the only one that uses trends, so it's the only thermal
governor that needs to be updated.
Tested-by: Manuel Krause <manuelkrause@xxxxxxxxxxxx>
Tested-by: szegad <szegadlo@xxxxxxxxxxxxxx>
Tested-by: prash <prash.n.rao@xxxxxxxxx>
Tested-by: amish <ammdispose-arch@xxxxxxxxx>
Tested-by: Matthias <morpheusxyz123@xxxxxxxx>
Reviewed-by: Javi Merino <javi.merino@xxxxxxx>
Signed-off-by: Zhang Rui <rui.zhang@xxxxxxxxx>
Signed-off-by: Chen Yu <yu.c.chen@xxxxxxxxx>
Signed-off-by: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxx
g>



Reverting this plus to other commits in the series (a67208e94d94
"Thermal: handle thermal zone device properly during system
sleep"
and 27f356149d59 "Thermal: do thermal zone update after a cooling
device registered") confirmed the performance was back to normal.

Bugzilla has the full discussion but this comment from one of the
reporters sums it up:

"In 4.4.3 and prior, my 2.40 MHz processor would fluctuate
between
1000 and 3400 MHz. In 4.4.4, the processor would fluctuate
between
400 and 700 MHz, according to /proc/cpuinfo.

Setting /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
to
performance, instead of the default "powersave" forces the CPU to
2400 MHz, and improves performance greatly, but still not to the
same level as in 4.4.3."

Any ideas?

Is this same "slowdown" also seen in 4.5?

thanks,

greg k-h


Yes, the same issue is seen on 4.5 according to the reporter.
What does it show here when performance drops?
grep . /sys/devices/system/cpu/intel_pstate/*

Is the problem still occurs if you set
/sys/class/thermal/thermal_zone*/mode to "disabled"

Thanks,
Srinivas


A separate thread was started which gave this insight:

"I think
the problem is your device has a passive trip temp of 0
/sys/class/thermal/thermal_zone0/trip_point_2_temp:0
/sys/class/thermal/thermal_zone0/trip_point_2_type:passive

Which triggers a false throttle = true. I think we should this trip as
invalid in the case of
if (tz->temperature >= trip_temp) {} check
in thermal_zone_trip_update()."

So would something like the following work?

diff --git a/drivers/thermal/step_wise.c b/drivers/thermal/step_wise.c
index ea9366a..1228797 100644
--- a/drivers/thermal/step_wise.c
+++ b/drivers/thermal/step_wise.c
@@ -143,7 +143,7 @@ static void thermal_zone_trip_update(struct thermal_zone_device *tz, int trip)
trend = get_tz_trend(tz, trip);
- if (tz->temperature >= trip_temp) {
+ if (trip_type != THERMAL_TRIPS_NONE && tz->temperature >= trip_temp) {
throttle = true;
trace_thermal_zone_trip(tz, trip, trip_type);
}

(completely untested, no idea if I'm even close)

Thanks,
Laura