Re: Regression in 4.8 - CPU speed set very low
From: Rafael J. Wysocki
Date: Thu Sep 29 2016 - 08:13:41 EST
On Wednesday, September 28, 2016 09:22:59 PM Larry Finger wrote:
> On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote:
> > On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
> > <Larry.Finger@xxxxxxxxxxxx> wrote:
> >> On 09/26/2016 10:12 PM, Doug Smythies wrote:
> >>>
> >>> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
> >>>>
> >>>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
> >>>>>
> >>>>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
> >>>>>>
> >>>>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
> >>>>>> But for both we need a reproducer anyway.
> >>>>>
> >>>>> I do not have a reliable reproducer. The condition has always
> >>>>> happened when
> >>>>> running a high-compute job such as a 'make -j8' on the kernel, or
> >>>>> building the
> >>>>> RPM for openSUSE's implementation of VirtualBox. The latter is what
> >>>>> I'm using
> >>>>> for most of my testing.
> >>>
> >>>
> >>> Run some CPU stressor and get all your CPU's going at 100% load.
> >>> And watch your core temperatures while you do so.
> >>
> >>
> >> for i in 1 2 3 4; do while : ; do : ; done & done
> >>
> >> triggered the fault in a few minutes.
> >>>
> >>>
> >>>>
> >>>>>> It also would be good to rule out the thermal throttling (as per
> >>>>>> the Srinivas' comments).
> >>>
> >>>
> >>> It is almost certainly thermal throttling, or similar causing
> >>> Clock modulation, of it seems 50%.
> >>
> >>
> >> While the infinite loops were running, the temps were:
> >>
> >> finger@linux-1t8h:~/rtlwifi_new> sensors
> >> coretemp-isa-0000
> >> Adapter: ISA adapter
> >> Physical id 0: +83.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> >> Core 0: +83.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> >> Core 1: +74.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> >
> > It looks like the trip point (high) temperature was exceeded causing
> > thermal throttling to kick in.
> >
> >> After the fault occurs, I get
> >>
> >> finger@linux-1t8h:~/rtlwifi_new> sensors
> >> coretemp-isa-0000
> >> Adapter: ISA adapter
> >> Physical id 0: +44.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> >> Core 0: +43.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> >> Core 1: +41.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> >
> > So after that it stays at 400 MHz forever, right?
> >
> >>>>>>
> >>>>>> For now, please tell me what's in
> >>>>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
> >>>>>
> >>>>> 800000
> >>>>
> >>>> Your effective freq is lower than 800MHz. One of the possible reason is
> >>>> thermal throttling.
> >>>>
> >>>> What distro you are using?
> >>>
> >>>
> >>> And what make and model of LapTop?
> >>
> >>
> >> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
> >> 2.90GHz. That is a dual-core unit with hyperthreading.
> >>
> >> @Rafael: As I write this, the system has been running the infinite loop test
> >> for almost 5 hours with kernel 4.7. I will leave that running while I'm
> >> gone, but I am certain that it is OK.
> >
> > OK, and what temperatures do you see while doing this?
>
> finger@linux-1t8h:~/linux-2.6> sensors
> coretemp-isa-0000
> Adapter: ISA adapter
> Physical id 0: +90.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> Core 0: +90.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
> Core 1: +78.0ÂC (high = +84.0ÂC, crit = +100.0ÂC)
>
> Once again, the CPU temp is greater than the "high" value; however, the clock
> rate continues to hold near 3600 MHz.
>
> My laptop was inadvertently put to sleep while I was gone. I forgot to leave a
> note for my wife and she quieted the noisy cpu fan. :)
It looks like in 4.8-rc we made a change that caused the "high" trip point to
be acted on.
Srinivas, Rui, do you recall what that can be?
One more question (I think I asked it previously): In the failing case
(4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it
ever go back higher or is it stuck at that level forever?
In any case, it may help to file a bug at bugzilla.kernel.org against
CPU/thermal or similar and let me know the bug number. We'll need to
collect some tracepoint data to debug this and some place to put them
into for easy reference.
Thanks,
Rafael