Re: [PATCH] [RFC] ondemand governor: dynamic cpufreq scaling withdifferent CPUs

From: Venki Pallipadi
Date: Wed Aug 10 2011 - 13:06:08 EST

Next message: Peter Zijlstra: "Re: [PATCH 3/5] writeback: dirty rate control"
Previous message: Randy Dunlap: "Re: [PATCH 2/2] Enable 'make CONFIG_FOO=y oldconfig'"
In reply to: Venki Pallipadi: "Re: [PATCH] [RFC] ondemand governor: dynamic cpufreq scaling withdifferent CPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun, Jul 31, 2011 at 6:08 PM, Yung, Winson W <winson.w.yung@xxxxxxxxx> wrote:
> Thanks Venki for the comment, here is my thoughts on the change:
>
> Correct me if I am wrong, it is probably true on PC that hw (bios) coordination mode for p-state is better over sw coordination mode, however on smartphone/tablet devices, there is no BIOS, so sw coordination mode is the only choice.

Hi Winson,

On all chips supporting hardware coordination that I know of, it is
done in hardware and not in BIOS. And there is a toggle switch that
lets BIOS/software choose either hardware or software coordination.

> I am not sure I can simplify the change inside do_dbs_timer per your comment, dbs_info is per cpu, how are other CPUs suppose to see cpu change without getting that CPU's dbs_info? That's why I store the cpu and dbs_timer interval info inside policy which shared by all CPUs.

They know about the leader (policy->cpu) and they can access that cpus
percpu data to find out. You can even store the variable in policy
itself.

Thanks,
Venki

>
> Having said that, how to handle cpu hotplug is something that my patch didn't do. If I offline one cpu and then online it again, it will no longer participate do_dbs_timer call. The code to add it back to dbs_timer cpus list and use the same shared policy is somewhat complicated. I am hoping to get some better idea on this from the community.
>
> BTW, this delayed work queue problem should exist in both ondemand and conservative governors.
>
> /Winson
>
>>-----Original Message-----
>>From: Venkatesh Pallipadi [mailto:venki@xxxxxxxxxx]
>>Sent: Friday, July 29, 2011 9:53 AM
>>To: Yung, Winson W
>>Cc: linux-kernel@xxxxxxxxxxxxxxx; Van De Ven, Arjan
>>Subject: Re: [PATCH] [RFC] ondemand governor: dynamic cpufreq scaling
>>with different CPUs
>>
>>On Tue, Jul 26, 2011 at 10:22 AM, Yung, Winson W
>><winson.w.yung@xxxxxxxxx> wrote:
>>>
>>> with using delayed work queue inside ondemand governor, I came to
>>> realize that there are certain scenarios the governor may not
>>> re-adjust cpu frequency swiftly for some duration. Here is a
>>> brief description of the problem:
>>>
>>> Currently when ondemand governor starts, it will create kthreads
>>> for each CPU, something like kondemand/0, kondemand/1 and so on.
>>> I found from ftrace that kondemand/0 is the only kthread that
>>> handles all the CPU frequency dynamic cpu frequency scaling, other
>>> kthread such as kondemand/1 is never used. I also confirmed this
>>> by debugging through the kernel code.
>>
>>This is not universally true. You are seeing this because the platform
>>is using SOFTWARE_COORDINATION mode for P-states. Most optimal in
>>terms of overhead is HARDWARE_COORDINATION mode, and this used to be
>>the recommended mode for platforms. And HW coordination is the
>>(atleast used to be) the more common mode across platforms.
>>Not sure whether this platform has SW coordination as a feature or bug?
>>
>>>
>>> So if there is no timer fired on CPU-0 (i.e. CPU-0 in C6), hence
>>> no deferrable timers will ever fired by kondemand/0 to sample all
>>> CPU workload. Since it is the only kthread doing the CPU freqency
>>> scaling for all CPUs, it is possible to enter in the situation
>>> where CPU frequency scaling will stop working for some duration
>>> until CPU-0 becomes not idle again.
>>
>>Yes. This will be a problem.
>>But, I am not sure whether the added complexity with the change below
>>to support the sub-optimal mode here. But, if we have a huge number of
>>platforms that are using software coordination, then there may not be
>>much of a choice here :(
>>
>>>
>>> Signed-off-by: Hari K Kanigeri <hari.kkanigeri@xxxxxxxxx>
>>> ---
>>> drivers/cpufreq/cpufreq_ondemand.c | 93
>>+++++++++++++++++++++++++++++++----
>>> include/linux/cpufreq.h | 3 +
>>> 2 files changed, 85 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/drivers/cpufreq/cpufreq_ondemand.c
>>b/drivers/cpufreq/cpufreq_ondemand.c
>>> index 891360e..b4f4f9d 100644
>>> --- a/drivers/cpufreq/cpufreq_ondemand.c
>>> +++ b/drivers/cpufreq/cpufreq_ondemand.c
>>> @@ -541,6 +541,9 @@ static void dbs_check_cpu(struct cpu_dbs_info_s
>>*this_dbs_info)
>>>
>>> static void do_dbs_timer(struct work_struct *work)
>>> {
>>> + u64 old_time, new_do_dbs_time;
>>> + u64 saved_last_time;
>>> +
>>> struct cpu_dbs_info_s *dbs_info =
>>> container_of(work, struct cpu_dbs_info_s, work.work);
>>> unsigned int cpu = dbs_info->cpu;
>>> @@ -548,6 +551,41 @@ static void do_dbs_timer(struct work_struct *work)
>>>
>>> int delay;
>>>
>>> + new_do_dbs_time = get_jiffies_64();
>>> +
>>> + /* Save a copy of last_do_dbs_time do_dbs_timer */
>>> +
>>> + saved_last_time = atomic64_read(
>>> + &dbs_info->cur_policy->last_do_dbs_time);
>>> +
>>> + /* If the last time do_dbs_timer ran is less than */
>>> + /* 'delay' delta from current time, one of other */
>>> + /* CPUs is still handling cpufreq governoring for */
>>> + /* for all CPUs. */
>>> +
>>> + if ((new_do_dbs_time - saved_last_time) <
>>> + dbs_info->cur_policy->last_delay) {
>>> +
>>> + /* If do_dbs_time ran not long ago */
>>> + /* (delta is less than last_delay) */
>>> + /* then bail out because there is */
>>> + /* no need to run so frequently. */
>>> + goto do_exit;
>>> + }
>>> +
>>> + old_time = atomic64_cmpxchg(&dbs_info->cur_policy-
>>>last_do_dbs_time,
>>> + saved_last_time,
>>new_do_dbs_time);
>>> +
>>> + /* If old_time is not equal to saved_last_time, */
>>> + /* it means another CPU is calling do_dbs_time. */
>>> +
>>> + if (old_time != saved_last_time)
>>> + goto do_exit;
>>> +
>>> + /* Switch cpu saved in the policy */
>>> + if (dbs_info->cur_policy->cpu != cpu)
>>> + dbs_info->cur_policy->cpu = cpu;
>>> +
>>
>>
>>This seems somewhat complicated way to do this.. One simpler way could
>>be to have a new atomic element in dbs_info that stores the current
>>CPU among policy->cpus doing dbs_timers. By default it will be
>>policy->cpu and when it goes idle it changes the variable to -1, so
>>that any other CPU can nominate itself (using atomic cmpxchg). If it
>>is still -1, then policy->cpu does the check anyway.
>>
>>> mutex_lock(&dbs_info->timer_mutex);
>>>
>>> /* Common NORMAL_SAMPLE setup */
>>> @@ -559,6 +597,7 @@ static void do_dbs_timer(struct work_struct *work)
>>> /* Setup timer for SUB_SAMPLE */
>>> dbs_info->sample_type = DBS_SUB_SAMPLE;
>>> delay = dbs_info->freq_hi_jiffies;
>>> + dbs_info->cur_policy->last_delay = delay;
>>> } else {
>>> /* We want all CPUs to do sampling nearly on
>>> * same jiffy
>>> @@ -566,14 +605,18 @@ static void do_dbs_timer(struct work_struct
>>*work)
>>> delay =
>>usecs_to_jiffies(dbs_tuners_ins.sampling_rate
>>> * dbs_info->rate_mult);
>>>
>>> - if (num_online_cpus() > 1)
>>> + if (num_online_cpus() > 1) {
>>> delay -= jiffies % delay;
>>> + dbs_info->cur_policy->last_delay =
>>delay;
>>> + }
>>> }
>>> } else {
>>> __cpufreq_driver_target(dbs_info->cur_policy,
>>> dbs_info->freq_lo, CPUFREQ_RELATION_H);
>>> delay = dbs_info->freq_lo_jiffies;
>>> + dbs_info->cur_policy->last_delay = delay;
>>> }
>>> +do_exit:
>>> schedule_delayed_work_on(cpu, &dbs_info->work, delay);
>>> mutex_unlock(&dbs_info->timer_mutex);
>>> }
>>> @@ -648,10 +691,12 @@ static int cpufreq_governor_dbs(struct
>>cpufreq_policy *policy,
>>> j_dbs_info->prev_cpu_nice =
>>> kstat_cpu(j).cpustat.ni
>>ce;
>>> }
>>> +
>>> + j_dbs_info->cpu = j;
>>> + j_dbs_info->rate_mult = 1;
>>> + ondemand_powersave_bias_init_cpu(j);
>>> }
>>> - this_dbs_info->cpu = cpu;
>>> - this_dbs_info->rate_mult = 1;
>>> - ondemand_powersave_bias_init_cpu(cpu);
>>> +
>>> /*
>>> * Start the timerschedule work, when this governor
>>> * is used for first time
>>> @@ -680,32 +725,58 @@ static int cpufreq_governor_dbs(struct
>>cpufreq_policy *policy,
>>> }
>>> mutex_unlock(&dbs_mutex);
>>>
>>> - mutex_init(&this_dbs_info->timer_mutex);
>>> - dbs_timer_init(this_dbs_info);
>>> + for_each_cpu(j, policy->cpus) {
>>> + struct cpu_dbs_info_s *j_dbs_info;
>>> + j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>>> + mutex_init(&j_dbs_info->timer_mutex);
>>> + dbs_timer_init(j_dbs_info);
>>> + }
>>> break;
>>>
>>> case CPUFREQ_GOV_STOP:
>>> - dbs_timer_exit(this_dbs_info);
>>> + for_each_cpu(j, policy->cpus) {
>>> + struct cpu_dbs_info_s *j_dbs_info;
>>> + j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>>> + dbs_timer_exit(j_dbs_info);
>>> + }
>>>
>>> mutex_lock(&dbs_mutex);
>>> - mutex_destroy(&this_dbs_info->timer_mutex);
>>> +
>>> + for_each_cpu(j, policy->cpus) {
>>> + struct cpu_dbs_info_s *j_dbs_info;
>>> + j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>>> + mutex_destroy(&j_dbs_info->timer_mutex);
>>> + }
>>> +
>>> dbs_enable--;
>>> mutex_unlock(&dbs_mutex);
>>> if (!dbs_enable)
>>> sysfs_remove_group(cpufreq_global_kobject,
>>> &dbs_attr_group);
>>> -
>>> break;
>>>
>>> case CPUFREQ_GOV_LIMITS:
>>> - mutex_lock(&this_dbs_info->timer_mutex);
>>> + /* Since ondemand governor is no longer running on the
>>*/
>>> + /* same CPU anymore,need to mutex_lock all timer_mutex
>>*/
>>> + /* before calling __cpufreq_driver_target. */
>>> + for_each_cpu(j, policy->cpus) {
>>> + struct cpu_dbs_info_s *j_dbs_info;
>>> + j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>>> + mutex_lock(&j_dbs_info->timer_mutex);
>>
>>Not sure why we need a lock on all policy->cpus mutexes here. Could as
>>well be all policy->cpus use policy->cpu's mutex. No?
>>
>>
>>> + }
>>> +
>>> if (policy->max < this_dbs_info->cur_policy->cur)
>>> __cpufreq_driver_target(this_dbs_info-
>>>cur_policy,
>>> policy->max, CPUFREQ_RELATION_H);
>>> else if (policy->min > this_dbs_info->cur_policy->cur)
>>> __cpufreq_driver_target(this_dbs_info-
>>>cur_policy,
>>> policy->min, CPUFREQ_RELATION_L);
>>> - mutex_unlock(&this_dbs_info->timer_mutex);
>>> +
>>> + for_each_cpu(j, policy->cpus) {
>>> + struct cpu_dbs_info_s *j_dbs_info;
>>> + j_dbs_info = &per_cpu(od_cpu_dbs_info, j);
>>> + mutex_unlock(&j_dbs_info->timer_mutex);
>>> + }
>>> break;
>>> }
>>> return 0;
>>> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
>>> index 11be48e..2295a35 100644
>>> --- a/include/linux/cpufreq.h
>>> +++ b/include/linux/cpufreq.h
>>> @@ -106,6 +106,9 @@ struct cpufreq_policy {
>>>
>>> struct kobject kobj;
>>> struct completion kobj_unregister;
>>> +
>>> + atomic64_t last_do_dbs_tim;
>>> + int last_delay;
>>> };
>>>
>>> #define CPUFREQ_ADJUST (0)
>>> --
>>> 1.7.1
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Peter Zijlstra: "Re: [PATCH 3/5] writeback: dirty rate control"
Previous message: Randy Dunlap: "Re: [PATCH 2/2] Enable 'make CONFIG_FOO=y oldconfig'"
In reply to: Venki Pallipadi: "Re: [PATCH] [RFC] ondemand governor: dynamic cpufreq scaling withdifferent CPUs"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]