Re: [PATCH v3 6/7] thermal/drivers/cpu_cooling: Introduce the cpu idle cooling driver

From: Martin Kepplinger
Date: Mon Aug 05 2019 - 02:53:56 EST


On 05.08.19 07:11, Martin Kepplinger wrote:
> ---
>
> On 05-04-18, 18:16, Daniel Lezcano wrote:
>> The cpu idle cooling driver performs synchronized idle injection across all
>> cpus belonging to the same cluster and offers a new method to cool down a SoC.
>>
>> Each cluster has its own idle cooling device, each core has its own idle
>> injection thread, each idle injection thread uses play_idle to enter idle. In
>> order to reach the deepest idle state, each cooling device has the idle
>> injection threads synchronized together.
>>
>> It has some similarity with the intel power clamp driver but it is actually
>> designed to work on the ARM architecture via the DT with a mathematical proof
>> with the power model which comes with the Documentation.
>>
>> The idle injection cycle is fixed while the running cycle is variable. That
>> allows to have control on the device reactivity for the user experience. At
>> the mitigation point the idle threads are unparked, they play idle the
>> specified amount of time and they schedule themselves. The last thread sets
>> the next idle injection deadline and when the timer expires it wakes up all
>> the threads which in turn play idle again. Meanwhile the running cycle is
>> changed by set_cur_state. When the mitigation ends, the threads are parked.
>> The algorithm is self adaptive, so there is no need to handle hotplugging.
>>
>> If we take an example of the balanced point, we can use the DT for the hi6220.
>>
>> The sustainable power for the SoC is 3326mW to mitigate at 75ÂC. Eight cores
>> running at full blast at the maximum OPP consumes 5280mW. The first value is
>> given in the DT, the second is calculated from the OPP with the formula:
>>
>> Pdyn = Cdyn x Voltage^2 x Frequency
>>
>> As the SoC vendors don't want to share the static leakage values, we assume
>> it is zero, so the Prun = Pdyn + Pstatic = Pdyn + 0 = Pdyn.
>>
>> In order to reduce the power to 3326mW, we have to apply a ratio to the
>> running time.
>>
>> ratio = (Prun - Ptarget) / Ptarget = (5280 - 3326) / 3326 = 0,5874
>>
>> We know the idle cycle which is fixed, let's assume 10ms. However from this
>> duration we have to substract the wake up latency for the cluster idle state.
>> In our case, it is 1.5ms. So for a 10ms latency for idle, we are really idle
>> 8.5ms.
>>
>> As we know the idle duration and the ratio, we can compute the running cycle.
>>
>> running_cycle = 8.5 / 0.5874 = 14.47ms
>>
>> So for 8.5ms of idle, we have 14.47ms of running cycle, and that brings the
>> SoC to the balanced trip point of 75ÂC.
>>
>> The driver has been tested on the hi6220 and it appears the temperature
>> stabilizes at 75ÂC with an idle injection time of 10ms (8.5ms real) and
>> running cycle of 14ms as expected by the theory above.
>>
>> Signed-off-by: Kevin Wangtao <kevin.wangtao@xxxxxxxxxx>
>> Signed-off-by: Daniel Lezcano <daniel.lezcano@xxxxxxxxxx>
>> ---
>> drivers/thermal/Kconfig | 10 +
>> drivers/thermal/cpu_cooling.c | 479 ++++++++++++++++++++++++++++++++++++++++++
>> include/linux/cpu_cooling.h | 6 +
>> 3 files changed, 495 insertions(+)
>>
>> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
>> index 5aaae1b..6c34117 100644
>> --- a/drivers/thermal/Kconfig
>> +++ b/drivers/thermal/Kconfig
>> @@ -166,6 +166,16 @@ config CPU_FREQ_THERMAL
>> This will be useful for platforms using the generic thermal interface
>> and not the ACPI interface.
>>
>> +config CPU_IDLE_THERMAL
>> + bool "CPU idle cooling strategy"
>> + depends on CPU_IDLE
>> + help
>> + This implements the generic CPU cooling mechanism through
>> + idle injection. This will throttle the CPU by injecting
>> + fixed idle cycle. All CPUs belonging to the same cluster
>> + will enter idle synchronously to reach the deepest idle
>> + state.
>> +
>> endchoice
>>
>> config CLOCK_THERMAL
>> diff --git a/drivers/thermal/cpu_cooling.c b/drivers/thermal/cpu_cooling.c
>> index 5c219dc..1eec8d6 100644
>> --- a/drivers/thermal/cpu_cooling.c
>> +++ b/drivers/thermal/cpu_cooling.c
>> @@ -10,18 +10,33 @@
>> * Viresh Kumar <viresh.kumar@xxxxxxxxxx>
>> *
>> */
>> +#define pr_fmt(fmt) "CPU cooling: " fmt
>> +
>> #include <linux/module.h>
>> #include <linux/thermal.h>
>> #include <linux/cpufreq.h>
>> +#include <linux/cpuidle.h>
>> #include <linux/err.h>
>> +#include <linux/freezer.h>
>> #include <linux/idr.h>
>> +#include <linux/kthread.h>
>> #include <linux/pm_opp.h>
>> #include <linux/slab.h>
>> +#include <linux/sched/prio.h>
>> +#include <linux/sched/rt.h>
>> +#include <linux/smpboot.h>
>> #include <linux/cpu.h>
>> #include <linux/cpu_cooling.h>
>>
>> +#include <linux/ratelimit.h>
>> +
>> +#include <linux/platform_device.h>
>> +#include <linux/of_platform.h>
>> +
>> #include <trace/events/thermal.h>
>>
>> +#include <uapi/linux/sched/types.h>
>> +
>> #ifdef CONFIG_CPU_FREQ_THERMAL
>> /*
>> * Cooling state <-> CPUFreq frequency
>> @@ -928,3 +943,467 @@ void cpufreq_cooling_unregister(struct thermal_cooling_device *cdev)
>> }
>> EXPORT_SYMBOL_GPL(cpufreq_cooling_unregister);
>> #endif /* CONFIG_CPU_FREQ_THERMAL */
>> +
>> +#ifdef CONFIG_CPU_IDLE_THERMAL
>> +/**
>> + * struct cpuidle_cooling_device - data for the idle cooling device
>> + * @cdev: a pointer to a struct thermal_cooling_device
>> + * @cpumask: a cpumask containing the CPU managed by the cooling device
>> + * @timer: a hrtimer giving the tempo for the idle injection cycles
>> + * @kref: a kernel refcount on this structure
>> + * @count: an atomic to keep track of the last task exiting the idle cycle
>> + * @idle_cycle: an integer defining the duration of the idle injection
>> + * @state: an normalized integer giving the state of the cooling device
>> + */
>> +struct cpuidle_cooling_device {
>> + struct thermal_cooling_device *cdev;
>> + struct cpumask *cpumask;
>> + struct hrtimer timer;
>> + struct kref kref;
>> + atomic_t count;
>> + unsigned int idle_cycle;
>> + unsigned long state;
>> +};
>> +
>> +struct cpuidle_cooling_thread {
>> + struct task_struct *tsk;
>> + int should_run;
>> +};
>> +
>> +static DEFINE_PER_CPU(struct cpuidle_cooling_thread, cpuidle_cooling_thread);
>> +static DEFINE_PER_CPU(struct cpuidle_cooling_device *, cpuidle_cooling_device);
>> +
>> +/**
>> + * cpuidle_cooling_wakeup - Wake up all idle injection threads
>> + * @idle_cdev: the idle cooling device
>> + *
>> + * Every idle injection task belonging to the idle cooling device and
>> + * running on an online cpu will be wake up by this call.
>> + */
>> +static void cpuidle_cooling_wakeup(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> + struct cpuidle_cooling_thread *cct;
>> + int cpu;
>> +
>> + for_each_cpu_and(cpu, idle_cdev->cpumask, cpu_online_mask) {
>> + cct = per_cpu_ptr(&cpuidle_cooling_thread, cpu);
>> + cct->should_run = 1;
>> + wake_up_process(cct->tsk);
>> + }
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_wakeup_fn - Running cycle timer callback
>> + * @timer: a hrtimer structure
>> + *
>> + * When the mitigation is acting, the CPU is allowed to run an amount
>> + * of time, then the idle injection happens for the specified delay
>> + * and the idle task injection schedules itself until the timer event
>> + * wakes the idle injection tasks again for a new idle injection
>> + * cycle. The time between the end of the idle injection and the timer
>> + * expiration is the allocated running time for the CPU.
>> + *
>> + * Always returns HRTIMER_NORESTART
>> + */
>> +static enum hrtimer_restart cpuidle_cooling_wakeup_fn(struct hrtimer *timer)
>> +{
>> + struct cpuidle_cooling_device *idle_cdev =
>> + container_of(timer, struct cpuidle_cooling_device, timer);
>> +
>> + cpuidle_cooling_wakeup(idle_cdev);
>> +
>> + return HRTIMER_NORESTART;
>> +}
>> +
>> +/**
>> + * cpuidle_cooling_runtime - Running time computation
>> + * @idle_cdev: the idle cooling device
>> + *
>> + * The running duration is computed from the idle injection duration
>> + * which is fixed. If we reach 100% of idle injection ratio, that
>> + * means the running duration is zero. If we have a 50% ratio
>> + * injection, that means we have equal duration for idle and for
>> + * running duration.
>> + *
>> + * The formula is deduced as the following:
>> + *
>> + * running = idle x ((100 / ratio) - 1)
>> + *
>> + * For precision purpose for integer math, we use the following:
>> + *
>> + * running = (idle x 100) / ratio - idle
>> + *
>> + * For example, if we have an injected duration of 50%, then we end up
>> + * with 10ms of idle injection and 10ms of running duration.
>> + *
>> + * Returns a s64 nanosecond based
>> + */
>> +static s64 cpuidle_cooling_runtime(struct cpuidle_cooling_device *idle_cdev)
>> +{
>> + s64 next_wakeup;
>> + unsigned long state = idle_cdev->state;
>> +
>> + /*
>> + * The function should not be called when there is no
>> + * mitigation because:
>> + * - that does not make sense
>> + * - we end up with a division by zero
>> + */
>> + if (!state)
>> + return 0;
>> +
>> + next_wakeup = (s64)((idle_cdev->idle_cycle * 100) / state) -
>> + idle_cdev->idle_cycle;
>> +
>> + return next_wakeup * NSEC_PER_USEC;
>> +}
>> +
>
> There is a bug in your calculation formula here when "state" becomes 100.
> You return 0 for the injection rate, which is the same as "rate" being 0,
> which is dangerous. You stop cooling when it's most necessary :)
>
> I'm not sure how much sense really being 100% idle makes, so I, when testing
> this, just say if (state == 100) { state = 99 }. Anyways, just don't return 0.
>

oh and also, this breaks S3 suspend:

Aug 5 06:09:20 pureos kernel: [ 807.487887] PM: suspend entry (deep)
Aug 5 06:09:40 pureos kernel: [ 807.501148] Filesystems sync: 0.013
seconds
Aug 5 06:09:40 pureos kernel: [ 807.501591] Freezing user space
processes ... (elapsed 0.003 seconds) done.
Aug 5 06:09:40 pureos kernel: [ 807.504741] OOM killer disabled.
Aug 5 06:09:40 pureos kernel: [ 807.504744] Freezing remaining
freezable tasks ...
Aug 5 06:09:40 pureos kernel: [ 827.517712] Freezing of tasks failed
after 20.002 seconds (4 tasks refusing to freeze, wq_busy=0):
Aug 5 06:09:40 pureos kernel: [ 827.527122] thermal-idle/0 S 0
161 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527131] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527148] __switch_to+0xb4/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527156] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527162] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527169] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527176] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527182] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527186] thermal-idle/1 S 0
162 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527192] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527197] __switch_to+0x188/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527203] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527208] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527213] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527218] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527222] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527226] thermal-idle/2 S 0
163 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527231] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527237] __switch_to+0xb4/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527242] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527247] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527259] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527264] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527268] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527272] thermal-idle/3 S 0
164 2 0x00000028
Aug 5 06:09:40 pureos kernel: [ 827.527278] Call trace:
Aug 5 06:09:40 pureos kernel: [ 827.527283] __switch_to+0xb4/0x200
Aug 5 06:09:40 pureos kernel: [ 827.527288] __schedule+0x1e0/0x488
Aug 5 06:09:40 pureos kernel: [ 827.527293] schedule+0x38/0xc8
Aug 5 06:09:40 pureos kernel: [ 827.527298] smpboot_thread_fn+0x250/0x2a8
Aug 5 06:09:40 pureos kernel: [ 827.527303] kthread+0xf4/0x120
Aug 5 06:09:40 pureos kernel: [ 827.527308] ret_from_fork+0x10/0x18
Aug 5 06:09:40 pureos kernel: [ 827.527375] Restarting kernel threads
... done.
Aug 5 06:09:40 pureos kernel: [ 827.527771] OOM killer enabled.
Aug 5 06:09:40 pureos kernel: [ 827.527772] Restarting tasks ... done.
Aug 5 06:09:40 pureos kernel: [ 827.528926] PM: suspend exit


do you know where things might go wrong here?

thanks,

martin