Re: CONFIG_X86_INTEL_PSTATE disables CPU frequency transition stats,many governors and other standard features

From: Andy Lutomirski
Date: Mon Apr 29 2013 - 22:21:27 EST


On 04/27/2013 07:35 AM, Rafael J. Wysocki wrote:
> On Saturday, April 27, 2013 04:58:53 AM Artem S. Tashkinov wrote:
>> Hello,
>>
>> Just wanted to let everyone know that CONFIG_X86_INTEL_PSTATE wreaks
>> havoc with the CPU frequency subsystem in the Linux kernel.
>>
>> With this option enabled:
>>
>> 1) All governors except performance and powersave are gone, ondemand
>> userspace, conservative
>>
>> 2) scaling_cur_freq is gone, thus user space utilities monitoring the CPU
>> frequency have stopped working
>>
>> 3) CPU frequency transition stats are gone, there's no "stats" directory
>> anywhere
>>
>> 4) scaling_available_frequencies is gone, so I cannot set the desired constant
>> CPU frequency (the userspace governor is not available anyway)
>>
>> Is this an intended behavior? I shrivel to think that's the case.
>>
>> The bug report is filed here: https://bugzilla.kernel.org/show_bug.cgi?id=57141
>
> intel_pstate is not a usual cpufreq driver and from the cpufreq's perspective
> it contains its own governor. That's the reason why the other scaling governors
> aren't available with it.
>
> The sysfs attributes mentioned above are missing simply because they don't make
> sense with intel_pstate.
>
> I'm only wondering which user space doesn't work correctly with intel_pstate as
> you said in the bug entry above.
>
> If you don't want to use intel_pstate (in which case the ACPI driver will be
> used instead), please append intel_pstate=disable to the kernel command line.

Out of curiosity, what is this driver doing?

It uses aperf/mperf magic to (I think) estimate how busy the CPU has
been recently. (This is clearly somewhat Intel-specific, but a similar
estimate could be made using knowledge of the programmed frequency and
the scheduler's idle time on any CPU.)

It samples that estimate every 10 ms (why is this even remotely
acceptable in a driver that's supposed to save power?).

Using that sample, it updates one of two PID controllers to bring the
busy or idle fraction (which one depends on the choice of controller) to
a target value of 109/256 or 75/256. In practice, it seems like once it
starts using the busy controller, it never goes back unless XPERF_FIX is
#defined, which it isn't.

It then adjusts the pstate as decreed by the PID controller.

At least this has the property that, the busier the CPU, the higher the
pstate.




Not to sidetrack the discussion, but (wearing my HFT hat for a moment)
has anyone else noticed that C1E is an absolute disaster for
performance? IMO the kernel should turn off C1E in case the BIOS is
malicious enough to turn it on, and then the kernel should treat
all-cores-idle as an extra, kind of strange idle state with very high
exit latency and use it (and adjust frequency) accordingly?

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/