Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism

From: Steven Noonan
Date: Mon Feb 09 2015 - 12:56:16 EST

Next message: Jim Davis: "randconfig build error with next-20150209, in fs/nfs/client.c"
Previous message: tip-bot for Jesse Brandeburg: "[tip:irq/core] genirq: Fix null pointer reference in irq_set_affinity_hint()"
In reply to: Preeti U Murthy: "Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism"
Next in thread: Steven Noonan: "Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Feb 9, 2015 at 3:51 AM, Preeti U Murthy
<preeti@xxxxxxxxxxxxxxxxxx> wrote:
> Hi Steven,
>
> On 02/09/2015 01:02 PM, Steven Noonan wrote:
>> On Sun, Feb 8, 2015 at 8:49 PM, Preeti U Murthy
>> <preeti@xxxxxxxxxxxxxxxxxx> wrote:
>>> The powerclamp driver injects idle periods to stay within the thermal constraints.
>>> The driver does a fake idle by spawning per-cpu threads that call the mwait
>>> instruction. This behavior of fake idle can confuse the other kernel subsystems.
>>> For instance it calls into the nohz tick handlers, which are meant to be called
>>> only by the idle thread. It sets the state of the tick in each cpu to idle and
>>> stops the tick, when there are tasks on the runqueue. As a result the callers of
>>> idle_cpu()/ tick_nohz_tick_stopped() see different states of the cpu; while the
>>> former thinks that the cpu is busy, the latter thinks that it is idle. The outcome
>>> may be inconsistency in the scheduler/nohz states which can lead to serious
>>> consequences. One of them was reported on this thread:
>>> https://lkml.org/lkml/2014/12/11/365.
>>>
>>> Thomas posted out a patch to disable the powerclamp driver from calling into the
>>> tick nohz code which has taken care of the above regression for the moment. However
>>> powerclamp driver as a result, will not be able to inject idle periods due to the
>>> presence of periodic ticks. With the current design of fake idle, we cannot move
>>> towards a better solution.
>>> https://lkml.org/lkml/2014/12/18/169
>>>
>>> This patch aims at removing the concept of fake idle and instead makes the cpus
>>> truly idle by throttling the runqueues during the idle injection periods. The situation
>>> is in fact very similar to throttling of cfs_rqs when they exceed their bandwidths.
>>> The idle injection metrics can be mapped to the bandwidth control metrics 'quota' and
>>> 'period' to achieve the same result. When the powerclamping is begun or when the
>>> clamping controls have been modified, the bandwidth for the root task group is set.
>>> The 'quota' will be the amount of time that the system needs to be busy and 'period'
>>> will be the sum of this busy duration and the idle duration as calculated by the driver.
>>> This gets rid of per-cpu kthreads, control cpu, hotplug notifiers and clamping mask since
>>> the thread starting powerclamping will set the bandwidth and throttling of all cpus will
>>> automatically fall in place. None of the other cpus need be bothered about this. This
>>> simplifies the design of the driver.
>>>
>>> Of course this is only if the idle injection metrics can be conveniently transformed
>>> into bandwidth control metrics. There are a couple of other primary concerns around if
>>> doing the below two in this patch is valid.
>>> a. This patch exports the functions to set the quota and period of task groups.
>>> b. This patch removes the constraint of not being able to set the root task grp's bandwidth.
>>>
>>> Signed-off-by: Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx>
>>
>> This doesn't compile.
>
> Thanks for reporting this! I realized that I had not compiled in the powerclamp driver
> as a module while compile testing it. I was focusing on the issues with the design and
> failed to cross verify this. Apologies for the inconvenience.
>
> Find the diff compile tested below.
>
> I also realized that clamp_cpus() that sets the bandwidth cannot be called from
> multiple places. Currently I am calling it from end_powerclamp(), when the user changes the
> idle clamping duration and from a queued timer. This will require synchronization between
> callers which is not really called for. The queued wakeup_timer alone can re-evaluate the
> clamping metrics after every throttle-unthrottle period and this should suffice as far
> as I can see. Thoughts ?

Hmm, I've had two system lockups so far while running a kernel with
intel_powerclamp loaded. Both times it slowly ground to a halt and
processes piled up...

> ----------------------------------------------------------------------------------
>
> V2 of intel_powerclamp driver
>
> From: Preeti U Murthy <preeti@xxxxxxxxxxxxxxxxxx>
>
>
> ---
> drivers/thermal/Kconfig | 1
> drivers/thermal/intel_powerclamp.c | 301 ++++++++++--------------------------
> include/linux/sched.h | 9 +
> kernel/sched/core.c | 6 -
> kernel/sched/sched.h | 5 -
> 5 files changed, 95 insertions(+), 227 deletions(-)
>
> diff --git a/drivers/thermal/Kconfig b/drivers/thermal/Kconfig
> index af40db0..4b7cd02 100644
> --- a/drivers/thermal/Kconfig
> +++ b/drivers/thermal/Kconfig
> @@ -233,6 +233,7 @@ config INTEL_POWERCLAMP
> depends on THERMAL
> depends on X86
> depends on CPU_SUP_INTEL
> + select CFS_BANDWIDTH
> help
> Enable this to enable Intel PowerClamp idle injection driver. This
> enforce idle time which results in more package C-state residency. The
> diff --git a/drivers/thermal/intel_powerclamp.c b/drivers/thermal/intel_powerclamp.c
> index 6ceebd6..4bd07bb 100644
> --- a/drivers/thermal/intel_powerclamp.c
> +++ b/drivers/thermal/intel_powerclamp.c
> @@ -51,6 +51,7 @@
> #include <linux/debugfs.h>
> #include <linux/seq_file.h>
> #include <linux/sched/rt.h>
> +#include <linux/sched.h>
>
> #include <asm/nmi.h>
> #include <asm/msr.h>
> @@ -78,20 +79,18 @@ static struct dentry *debug_dir;
> static unsigned int set_target_ratio;
> static unsigned int current_ratio;
> static bool should_skip;
> +static unsigned int count = 0;
> static bool reduce_irq;
> static atomic_t idle_wakeup_counter;
> -static unsigned int control_cpu; /* The cpu assigned to collect stat and update
> - * control parameters. default to BSP but BSP
> - * can be offlined.
> - */
> static bool clamping;
>
> +/* Timer that evaluates bandwidth reset if clamping knobs have changed */
> +static void clamp_timer_fn(unsigned long foo);
> +DEFINE_TIMER(wakeup_timer, clamp_timer_fn, 0, 0);
> +
> +static void clamp_cpus(void);
>
> -static struct task_struct * __percpu *powerclamp_thread;
> static struct thermal_cooling_device *cooling_dev;
> -static unsigned long *cpu_clamping_mask; /* bit map for tracking per cpu
> - * clamping thread
> - */
>
> static unsigned int duration;
> static unsigned int pkg_cstate_ratio_cur;
> @@ -163,7 +162,7 @@ static int window_size_set(const char *arg, const struct kernel_param *kp)
> smp_mb();
>
> exit_win:
> -
> + clamp_cpus();
> return ret;
> }
>
> @@ -256,10 +255,6 @@ static u64 pkg_state_counter(void)
> return count;
> }
>
> -static void noop_timer(unsigned long foo)
> -{
> - /* empty... just the fact that we get the interrupt wakes us up */
> -}
>
> static unsigned int get_compensation(int ratio)
> {
> @@ -362,100 +357,77 @@ static bool powerclamp_adjust_controls(unsigned int target_ratio,
> return set_target_ratio + guard <= current_ratio;
> }
>
> -static int clamp_thread(void *arg)
> +static void clamp_cpus(void)
> {
> - int cpunr = (unsigned long)arg;
> - DEFINE_TIMER(wakeup_timer, noop_timer, 0, 0);
> - static const struct sched_param param = {
> - .sched_priority = MAX_USER_RT_PRIO/2,
> - };
> - unsigned int count = 0;
> unsigned int target_ratio;
> + u64 quota, period;
> + int sleeptime;
> + unsigned long target_jiffies;
> + unsigned int guard;
> + unsigned int compensation;
> + int interval; /* jiffies to sleep for each attempt */
> + unsigned int duration_jiffies;
> + unsigned int window_size_now;
> +
> +again:
> + if (clamping == false)
> + goto out;
> +
> + quota = RUNTIME_INF;
> + period = NSEC_PER_SEC;
> + compensation = 0;
> + duration_jiffies = msecs_to_jiffies(duration);
> + /*
> + * make sure user selected ratio does not take effect until
> + * the next round. adjust target_ratio if user has changed
> + * target such that we can converge quickly.
> + */
> + target_ratio = set_target_ratio;
> + guard = 1 + target_ratio/20;
> + window_size_now = window_size;
>
> - set_bit(cpunr, cpu_clamping_mask);
> - set_freezable();
> - init_timer_on_stack(&wakeup_timer);
> - sched_setscheduler(current, SCHED_FIFO, &param);
> -
> - while (true == clamping && !kthread_should_stop() &&
> - cpu_online(cpunr)) {
> - int sleeptime;
> - unsigned long target_jiffies;
> - unsigned int guard;
> - unsigned int compensation = 0;
> - int interval; /* jiffies to sleep for each attempt */
> - unsigned int duration_jiffies = msecs_to_jiffies(duration);
> - unsigned int window_size_now;
> -
> - try_to_freeze();
> - /*
> - * make sure user selected ratio does not take effect until
> - * the next round. adjust target_ratio if user has changed
> - * target such that we can converge quickly.
> - */
> - target_ratio = set_target_ratio;
> - guard = 1 + target_ratio/20;
> - window_size_now = window_size;
> - count++;
> -
> - /*
> - * systems may have different ability to enter package level
> - * c-states, thus we need to compensate the injected idle ratio
> - * to achieve the actual target reported by the HW.
> - */
> - compensation = get_compensation(target_ratio);
> - interval = duration_jiffies*100/(target_ratio+compensation);
> -
> - /* align idle time */
> - target_jiffies = roundup(jiffies, interval);
> - sleeptime = target_jiffies - jiffies;
> - if (sleeptime <= 0)
> - sleeptime = 1;
> - schedule_timeout_interruptible(sleeptime);
> - /*
> - * only elected controlling cpu can collect stats and update
> - * control parameters.
> - */
> - if (cpunr == control_cpu && !(count%window_size_now)) {
> - should_skip =
> - powerclamp_adjust_controls(target_ratio,
> - guard, window_size_now);
> - smp_mb();
> - }
> -
> - if (should_skip)
> - continue;
> -
> - target_jiffies = jiffies + duration_jiffies;
> - mod_timer(&wakeup_timer, target_jiffies);
> - if (unlikely(local_softirq_pending()))
> - continue;
> - /*
> - * stop tick sched during idle time, interrupts are still
> - * allowed. thus jiffies are updated properly.
> - */
> - preempt_disable();
> - /* mwait until target jiffies is reached */
> - while (time_before(jiffies, target_jiffies)) {
> - unsigned long ecx = 1;
> - unsigned long eax = target_mwait;
> -
> - /*
> - * REVISIT: may call enter_idle() to notify drivers who
> - * can save power during cpu idle. same for exit_idle()
> - */
> - local_touch_nmi();
> - stop_critical_timings();
> - mwait_idle_with_hints(eax, ecx);
> - start_critical_timings();
> - atomic_inc(&idle_wakeup_counter);
> - }
> - preempt_enable();
> + /*
> + * systems may have different ability to enter package level
> + * c-states, thus we need to compensate the injected idle ratio
> + * to achieve the actual target reported by the HW.
> + */
> + compensation = get_compensation(target_ratio);
> + interval = duration_jiffies*100/(target_ratio+compensation);
> +
> + /* align idle time */
> + target_jiffies = roundup(jiffies, interval);
> + sleeptime = target_jiffies - jiffies;
> + if (sleeptime <= 0)
> + sleeptime = 1;
> +
> + if (!(count%window_size_now)) {
> + should_skip =
> + powerclamp_adjust_controls(target_ratio,
> + guard, window_size_now);
> + smp_mb();
> }
> - del_timer_sync(&wakeup_timer);
> - clear_bit(cpunr, cpu_clamping_mask);
>
> - return 0;
> + if (should_skip)
> + goto again;
> +
> + target_jiffies = jiffies + sleeptime + duration_jiffies;
> + mod_timer(&wakeup_timer, target_jiffies);
> + if (unlikely(local_softirq_pending()))
> + goto again;
> +
> + quota = jiffies_to_usecs(sleeptime);
> + period = jiffies_to_usecs(sleeptime + duration_jiffies);
> +
> +out:
> + tg_set_cfs_quota(&root_task_group, quota);
> + tg_set_cfs_period(&root_task_group, period);
> +}
> +
> +static void clamp_timer_fn(unsigned long foo)
> +{
> + /* Evaluate to see if clamping controls need to be adjusted */
> + count++;
> + clamp_cpus();
> }
>
> /*
> @@ -501,8 +473,7 @@ static void poll_pkg_cstate(struct work_struct *dummy)
>
> static int start_power_clamp(void)
> {
> - unsigned long cpu;
> - struct task_struct *thread;
> + clamping = true;
>
> /* check if pkg cstate counter is completely 0, abort in this case */
> if (!has_pkg_state_counter()) {
> @@ -511,108 +482,21 @@ static int start_power_clamp(void)
> }
>
> set_target_ratio = clamp(set_target_ratio, 0U, MAX_TARGET_RATIO - 1);
> - /* prevent cpu hotplug */
> - get_online_cpus();
> -
> - /* prefer BSP */
> - control_cpu = 0;
> - if (!cpu_online(control_cpu))
> - control_cpu = smp_processor_id();
>
> - clamping = true;
> schedule_delayed_work(&poll_pkg_cstate_work, 0);
> -
> - /* start one thread per online cpu */
> - for_each_online_cpu(cpu) {
> - struct task_struct **p =
> - per_cpu_ptr(powerclamp_thread, cpu);
> -
> - thread = kthread_create_on_node(clamp_thread,
> - (void *) cpu,
> - cpu_to_node(cpu),
> - "kidle_inject/%ld", cpu);
> - /* bind to cpu here */
> - if (likely(!IS_ERR(thread))) {
> - kthread_bind(thread, cpu);
> - wake_up_process(thread);
> - *p = thread;
> - }
> -
> - }
> - put_online_cpus();
> + clamp_cpus();
>
> return 0;
> }
>
> static void end_power_clamp(void)
> {
> - int i;
> - struct task_struct *thread;
> -
> clamping = false;
> - /*
> - * make clamping visible to other cpus and give per cpu clamping threads
> - * sometime to exit, or gets killed later.
> - */
> - smp_mb();
> - msleep(20);
> - if (bitmap_weight(cpu_clamping_mask, num_possible_cpus())) {
> - for_each_set_bit(i, cpu_clamping_mask, num_possible_cpus()) {
> - pr_debug("clamping thread for cpu %d alive, kill\n", i);
> - thread = *per_cpu_ptr(powerclamp_thread, i);
> - kthread_stop(thread);
> - }
> - }
> -}
>
> -static int powerclamp_cpu_callback(struct notifier_block *nfb,
> - unsigned long action, void *hcpu)
> -{
> - unsigned long cpu = (unsigned long)hcpu;
> - struct task_struct *thread;
> - struct task_struct **percpu_thread =
> - per_cpu_ptr(powerclamp_thread, cpu);
> -
> - if (false == clamping)
> - goto exit_ok;
> -
> - switch (action) {
> - case CPU_ONLINE:
> - thread = kthread_create_on_node(clamp_thread,
> - (void *) cpu,
> - cpu_to_node(cpu),
> - "kidle_inject/%lu", cpu);
> - if (likely(!IS_ERR(thread))) {
> - kthread_bind(thread, cpu);
> - wake_up_process(thread);
> - *percpu_thread = thread;
> - }
> - /* prefer BSP as controlling CPU */
> - if (cpu == 0) {
> - control_cpu = 0;
> - smp_mb();
> - }
> - break;
> - case CPU_DEAD:
> - if (test_bit(cpu, cpu_clamping_mask)) {
> - pr_err("cpu %lu dead but powerclamping thread is not\n",
> - cpu);
> - kthread_stop(*percpu_thread);
> - }
> - if (cpu == control_cpu) {
> - control_cpu = smp_processor_id();
> - smp_mb();
> - }
> - }
> -
> -exit_ok:
> - return NOTIFY_OK;
> + clamp_cpus();
> + del_timer_sync(&wakeup_timer);
> }
>
> -static struct notifier_block powerclamp_cpu_notifier = {
> - .notifier_call = powerclamp_cpu_callback,
> -};
> -
> static int powerclamp_get_max_state(struct thermal_cooling_device *cdev,
> unsigned long *state)
> {
> @@ -656,6 +540,7 @@ static int powerclamp_set_cur_state(struct thermal_cooling_device *cdev,
> }
>
> exit_set:
> + clamp_cpus();
> return ret;
> }
>
> @@ -716,7 +601,6 @@ static int powerclamp_debug_show(struct seq_file *m, void *unused)
> {
> int i = 0;
>
> - seq_printf(m, "controlling cpu: %d\n", control_cpu);
> seq_printf(m, "pct confidence steady dynamic (compensation)\n");
> for (i = 0; i < MAX_TARGET_RATIO; i++) {
> seq_printf(m, "%d\t%lu\t%lu\t%lu\n",
> @@ -762,33 +646,20 @@ file_error:
> static int powerclamp_init(void)
> {
> int retval;
> - int bitmap_size;
> -
> - bitmap_size = BITS_TO_LONGS(num_possible_cpus()) * sizeof(long);
> - cpu_clamping_mask = kzalloc(bitmap_size, GFP_KERNEL);
> - if (!cpu_clamping_mask)
> - return -ENOMEM;
>
> /* probe cpu features and ids here */
> retval = powerclamp_probe();
> if (retval)
> - goto exit_free;
> + goto exit;
>
> /* set default limit, maybe adjusted during runtime based on feedback */
> window_size = 2;
> - register_hotcpu_notifier(&powerclamp_cpu_notifier);
> -
> - powerclamp_thread = alloc_percpu(struct task_struct *);
> - if (!powerclamp_thread) {
> - retval = -ENOMEM;
> - goto exit_unregister;
> - }
>
> cooling_dev = thermal_cooling_device_register("intel_powerclamp", NULL,
> &powerclamp_cooling_ops);
> if (IS_ERR(cooling_dev)) {
> retval = -ENODEV;
> - goto exit_free_thread;
> + goto exit;
> }
>
> if (!duration)
> @@ -798,23 +669,15 @@ static int powerclamp_init(void)
>
> return 0;
>
> -exit_free_thread:
> - free_percpu(powerclamp_thread);
> -exit_unregister:
> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> -exit_free:
> - kfree(cpu_clamping_mask);
> +exit:
> return retval;
> }
> module_init(powerclamp_init);
>
> static void powerclamp_exit(void)
> {
> - unregister_hotcpu_notifier(&powerclamp_cpu_notifier);
> end_power_clamp();
> - free_percpu(powerclamp_thread);
> thermal_cooling_device_unregister(cooling_dev);
> - kfree(cpu_clamping_mask);
>
> cancel_delayed_work_sync(&poll_pkg_cstate_work);
> debugfs_remove_recursive(debug_dir);
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 8db31ef..2493942 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -163,6 +163,11 @@ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
> load += n*(FIXED_1-exp); \
> load >>= FSHIFT;
>
> +/*
> + * single value that denotes runtime == period, ie unlimited time.
> + */
> +#define RUNTIME_INF ((u64)~0ULL)
> +
> extern unsigned long total_forks;
> extern int nr_threads;
> DECLARE_PER_CPU(unsigned long, process_counts);
> @@ -3002,6 +3007,10 @@ extern long sched_getaffinity(pid_t pid, struct cpumask *mask);
>
> #ifdef CONFIG_CGROUP_SCHED
> extern struct task_group root_task_group;
> +#ifdef CONFIG_CFS_BANDWIDTH
> +extern int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us);
> +extern int tg_set_cfs_period(struct task_group *tg, long cfs_period_us);
> +#endif /* CONFIG_CFS_BANDWIDTH */
> #endif /* CONFIG_CGROUP_SCHED */
>
> extern int task_can_switch_user(struct user_struct *up,
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index e628cb1..7471b06 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7097,6 +7097,7 @@ int in_sched_functions(unsigned long addr)
> * Every task in system belongs to this group at bootup.
> */
> struct task_group root_task_group;
> +EXPORT_SYMBOL_GPL(root_task_group);
> LIST_HEAD(task_groups);
> #endif
>
> @@ -8059,9 +8060,6 @@ static int tg_set_cfs_bandwidth(struct task_group *tg, u64 period, u64 quota)
> int i, ret = 0, runtime_enabled, runtime_was_enabled;
> struct cfs_bandwidth *cfs_b = &tg->cfs_bandwidth;
>
> - if (tg == &root_task_group)
> - return -EINVAL;
> -
> /*
> * Ensure we have at some amount of bandwidth every period. This is
> * to prevent reaching a state of large arrears when throttled via
> @@ -8141,6 +8139,7 @@ int tg_set_cfs_quota(struct task_group *tg, long cfs_quota_us)
>
> return tg_set_cfs_bandwidth(tg, period, quota);
> }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_quota);
>
> long tg_get_cfs_quota(struct task_group *tg)
> {
> @@ -8164,6 +8163,7 @@ int tg_set_cfs_period(struct task_group *tg, long cfs_period_us)
>
> return tg_set_cfs_bandwidth(tg, period, quota);
> }
> +EXPORT_SYMBOL_GPL(tg_set_cfs_period);
>
> long tg_get_cfs_period(struct task_group *tg)
> {
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 9a2a45c..20493e4 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -72,11 +72,6 @@ extern void update_cpu_load_active(struct rq *this_rq);
> * These are the 'tuning knobs' of the scheduler:
> */
>
> -/*
> - * single value that denotes runtime == period, ie unlimited time.
> - */
> -#define RUNTIME_INF ((u64)~0ULL)
> -
> static inline int fair_policy(int policy)
> {
> return policy == SCHED_NORMAL || policy == SCHED_BATCH;
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jim Davis: "randconfig build error with next-20150209, in fs/nfs/client.c"
Previous message: tip-bot for Jesse Brandeburg: "[tip:irq/core] genirq: Fix null pointer reference in irq_set_affinity_hint()"
In reply to: Preeti U Murthy: "Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism"
Next in thread: Steven Noonan: "Re: [PATCH V2] idle/intel_powerclamp: Redesign idle injection to use bandwidth control mechanism"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]