Re: [RFCv5 PATCH 43/46] sched/{fair,cpufreq_sched}: add reset_capacity interface

From: Steve Muckle
Date: Thu Oct 08 2015 - 16:41:23 EST


Hi Juri,

On 07/07/2015 11:24 AM, Morten Rasmussen wrote:
> From: Juri Lelli <juri.lelli@xxxxxxx>
>
> When a CPU is going idle it is pointless to ask for an OPP update as we
> would wake up another task only to ask for the same capacity we are already
> running at (utilization gets moved to blocked_utilization). We thus add
> cpufreq_sched_reset_capacity() interface to just reset our current capacity
> request without triggering any real update. At wakeup we will use the
> decayed utilization to select an appropriate OPP.
...
> +void cpufreq_sched_reset_cap(int cpu)
> +{
> + per_cpu(pcpu_capacity, cpu) = 0;
> +}
> +
...
> @@ -4427,9 +4427,13 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> if (sched_energy_freq() && task_sleep) {
> unsigned long req_cap = get_cpu_usage(cpu_of(rq));
>
> - req_cap = req_cap * capacity_margin
> - >> SCHED_CAPACITY_SHIFT;
> - cpufreq_sched_set_cap(cpu_of(rq), req_cap);
> + if (rq->cfs.nr_running) {
> + req_cap = req_cap * capacity_margin
> + >> SCHED_CAPACITY_SHIFT;
> + cpufreq_sched_set_cap(cpu_of(rq), req_cap);
> + } else {
> + cpufreq_sched_reset_cap(cpu_of(rq));
> + }
> }

Though I understand the initial stated motivation here (avoiding a
redundant capacity request upon idle entry), releasing the CPU's
capacity request altogether on idle seems like it could be a contentious
policy decision.

An example to illustrate my concern:
- 2 CPU single frequency domain topology
- task A is a small frequently-running task on CPU0
- task B is a heavier intermittent task running on CPU1

Task B is driving the frequency of the cluster high, but whenever it
sleeps CPU1 becomes idle and the capacity request is dropped. If there's
any activity on CPU0 that causes cpufreq_sched_set_cap() to be called
(which is likely, given task A runs often) the cluster frequency will be
lowered. Task B's performance will be impacted when it wakes up because
initially the OPP will be insufficient. Power may or may not be
impacted, depending on the characteristics of the workload and system,
and whether energy is saved with the additional frequency scaling.

The decision of when a CPU's vote should be decayed or removed is more
policy where I believe there's no single right answer and in the past,
has been solved with tunables. The interactive governor's slack timer
controls how long it will allow an idle CPU to request a frequency > fmin.

If the utilization of a long-idle CPU could be decayed by a different
CPU in the system, perhaps it could take care of updating that CPU's vote?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/