[RFC v2 8/8] sched/{fair,tune}: add support for negative boosting

From: Patrick Bellasi
Date: Thu Oct 27 2016 - 13:41:45 EST


Boosting support allows to inflate a signal with a margin which is
defined to be proportional to its delta from its maximum possible
value. Such a mechanism allows to run a task on an OPP which is higher
with respect to the minimum capacity which can satisfy its demands.

In certain use-cases we could be interested to the opposite goal, i.e.
running a task on an OPP which is lower than the minimum required.
Currently the only why to achieve such a goal is to use the "powersave"
governor, thus forcing all tasks to run at the lower OPP, or the
"userspace" governor, still forcing all task to run at a certain OPP.

With the availability of schedutil and the addition of SchedTune, we now
have the support to tune the way OPPs are selected depending on which
tasks are active on a CPU.

This patch extends SchedTune to introduce the support for negative
boosting. While boosting inflate a signal, with negative boosting we can
reduce artificially the value of a signal. The boosting strategy used to
reduce a signal is quite simple and extends the concept of "margin"
already used for positive boosting.

The Boost (B) value [%] is used to compute a Margin (M) which, in case
of negative boosting, is a fraction of the original Signal (S):

M = B * S, when B in [-100%, 0%)

Such a value of M is defined to be a negative quantity which, once added
to the original signal S, allows to reduce the amount of that signal by
a fraction of the original signal.

With such a definition, a 50% utilization task will run at:
- 25% capacity OPP when boosted -50%
- minimum capacity OPP when boosted -100%

It's worth to notice that, the boosting of all tasks on a CPU is
aggregated to figure out what is the max boost value currently required.
Thus, for example, if we have two tasks:
T1 boosted @ -20%
T2 boosted @ +30%
when T2 is active, we boost the CPU +30%, also if T1 is active. While
the CPU is "slowed-down" 20% when T1 is the only task active on that
CPU.

Cc: Jonathan Corbet <corbet@xxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Suggested-by: Srinath Sridharan <srinathsr@xxxxxxxxxx>
Signed-off-by: Patrick Bellasi <patrick.bellasi@xxxxxxx>
---
Documentation/scheduler/sched-tune.txt | 44 ++++++++++++++++++++++++++++++----
include/linux/sched/sysctl.h | 6 ++---
kernel/sched/fair.c | 38 +++++++++++++++++++++--------
kernel/sched/tune.c | 33 +++++++++++++++----------
kernel/sysctl.c | 3 ++-
5 files changed, 92 insertions(+), 32 deletions(-)

diff --git a/Documentation/scheduler/sched-tune.txt b/Documentation/scheduler/sched-tune.txt
index da7b3eb..5822f9f 100644
--- a/Documentation/scheduler/sched-tune.txt
+++ b/Documentation/scheduler/sched-tune.txt
@@ -100,12 +100,17 @@ This permits expressing a boost value as an integer in the range [0..100].

A value of 0 (default) for a CFS task means that schedutil will attempt
to match compute capacity of the CPU where the task is scheduled to
-match its current utilization with a few spare cycles left. A value of
-100 means that schedutil will select the highest available OPP.
+match its current utilization with a few spare cycles left.

-The range between 0 and 100 can be set to satisfy other scenarios suitably.
-For example to satisfy interactive response or depending on other system events
-(battery level, thermal status, etc).
+A value of 100 means that schedutil will select the highest available OPP,
+while a negative value means that schedutils will try to run tasks at lower
+OPPs. Togheter, positive and negative boost value allows to get from scedutil
+behaviors similar to that of the existing "performance" and "powersave"
+governors but with a more fine grained control.
+
+The range between -100 and 100 can be set to satisfy other scenarios suitably.
+For example to satisfy interactive response or other system events (battery
+level, thermal status, etc).

A CGroup based extension is also provided, which permits further user-space
defined task classification to tune the scheduler for different goals depending
@@ -227,6 +232,27 @@ corresponding to a 50% boost is midway from the original signal and the upper
bound. Boosting by 100% generates a boosted signal which is always saturated to
the upper bound.

+Negative boosting
+-----------------
+
+While postive boosting uses the SPC strategy to inflate a signal, with negative
+boosting we can reduce artificially the value of a signal. The boosting
+strategy used to reduce a signal is quite simple and extends the concept of
+"margin" already used for positive boosting.
+
+When sched_cfs_boost is defined in [-100%, 0%), the boost value [%] is used to
+compute a margin which is a fraction of the original signal:
+
+ margin := sched_cfs_boost * signal
+
+Such a margin is defined to be a negative quantity which, once added to the
+original signal, it allows to reduce the amount of that signal by a fraction of
+the original value.
+
+With such a definition, for example a 50% utilization task will run at:
+ - 25% capacity OPP when boosted -50%
+ - minimum capacity OPP when boosted -100%
+

4. OPP selection using boosted CPU utilization
==============================================
@@ -304,6 +330,14 @@ main characteristics:
which has to compute the per CPU boosting once there are multiple
RUNNABLE tasks with different boost values.

+It's worth to notice that, the boosting of all tasks on a CPU is aggregated to
+figure out what is the max boost value currently required. Thus, for example,
+if we have two tasks:
+ T1 boosted @ -20%
+ T2 boosted @ +30%
+when T2 is active, we boost the CPU +30%, also if T1 is active.
+While the CPU is "slowed-down" 20% when T1 is the only task active on that CPU.
+
Such a simple design should allow servicing the main utilization scenarios
identified so far. It provides a simple interface which can be used to manage
the power-performance of all tasks or only selected tasks. Moreover, this
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index 5bfbb14..fe878c9 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -56,16 +56,16 @@ extern unsigned int sysctl_sched_cfs_bandwidth_slice;
#endif

#ifdef CONFIG_SCHED_TUNE
-extern unsigned int sysctl_sched_cfs_boost;
+extern int sysctl_sched_cfs_boost;
int sysctl_sched_cfs_boost_handler(struct ctl_table *table, int write,
void __user *buffer, size_t *length,
loff_t *ppos);
-static inline unsigned int get_sysctl_sched_cfs_boost(void)
+static inline int get_sysctl_sched_cfs_boost(void)
{
return sysctl_sched_cfs_boost;
}
#else
-static inline unsigned int get_sysctl_sched_cfs_boost(void)
+static inline int get_sysctl_sched_cfs_boost(void)
{
return 0;
}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f56953b..43a4989 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5580,17 +5580,34 @@ struct reciprocal_value schedtune_spc_rdiv;
* schedtune_margin returns the "margin" to be added on top of
* the original value of a "signal".
*
- * The Boost (B) value [%] is used to compute a Margin (M) which
- * is proportional to the complement of the original Signal (S):
+ * The Boost (B) value [%] is used to compute a Margin (M) which, in case of
+ * positive boosting, it is proportional to the complement of the original
+ * Signal (S):
*
- * M = B * (SCHED_CAPACITY_SCALE - S)
+ * M = B * (SCHED_CAPACITY_SCALE - S), when B is in (0%, 100%]
+ *
+ * In case of negative boosting, the computed margin is a fraction of the
+ * original S:
+ *
+ * M = B * S, when B in [-100%, 0%)
*
* The obtained value M could be used by the caller to "boost" S.
*/
-static unsigned long
-schedtune_margin(unsigned long signal, unsigned int boost)
+static long
+schedtune_margin(unsigned long signal, int boost)
{
- unsigned long long margin = 0;
+ long long margin = 0;
+
+ /* A -100% boost nullify the orignal signal */
+ if (unlikely(boost == -100))
+ return -signal;
+
+ /* A negative boost produces a proportional (negative) margin */
+ if (unlikely(boost < 0)) {
+ margin = -boost * margin;
+ margin = reciprocal_divide(margin, schedtune_spc_rdiv);
+ return -margin;
+ }

/* Do not boost saturated signals */
if (signal >= SCHED_CAPACITY_SCALE)
@@ -5606,10 +5623,10 @@ schedtune_margin(unsigned long signal, unsigned int boost)
return margin;
}

-static inline unsigned long
+static inline long
schedtune_cpu_margin(unsigned long util, int cpu)
{
- unsigned int boost = schedtune_cpu_boost(cpu);
+ int boost = schedtune_cpu_boost(cpu);

if (boost == 0)
return 0UL;
@@ -5619,7 +5636,7 @@ schedtune_cpu_margin(unsigned long util, int cpu)

#else /* CONFIG_SCHED_TUNE */

-static inline unsigned long
+static inline long
schedtune_cpu_margin(unsigned long util, int cpu)
{
return 0;
@@ -5665,9 +5682,10 @@ unsigned long boosted_cpu_util(int cpu)
{
unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
unsigned long capacity = capacity_orig_of(cpu);
+ int boost = schedtune_cpu_boost(cpu);

/* Do not boost saturated utilizations */
- if (util >= capacity)
+ if (boost >= 0 && util >= capacity)
return capacity;

/* Add margin to current CPU's capacity */
diff --git a/kernel/sched/tune.c b/kernel/sched/tune.c
index 965a3e1..ed90830 100644
--- a/kernel/sched/tune.c
+++ b/kernel/sched/tune.c
@@ -13,7 +13,7 @@
#include "sched.h"
#include "tune.h"

-unsigned int sysctl_sched_cfs_boost __read_mostly;
+int sysctl_sched_cfs_boost __read_mostly;

#ifdef CONFIG_CGROUP_SCHED_TUNE

@@ -32,7 +32,7 @@ struct schedtune {
int idx;

/* Boost value for tasks on that SchedTune CGroup */
- unsigned int boost;
+ int boost;

};

@@ -95,10 +95,10 @@ static struct schedtune *allocated_group[boostgroups_max] = {
*/
struct boost_groups {
/* Maximum boost value for all RUNNABLE tasks on a CPU */
- unsigned int boost_max;
+ int boost_max;
struct {
/* The boost for tasks on that boost group */
- unsigned int boost;
+ int boost;
/* Count of RUNNABLE tasks on that boost group */
unsigned int tasks;
} group[boostgroups_max];
@@ -112,15 +112,14 @@ DEFINE_PER_CPU(struct boost_groups, cpu_boost_groups);
static void
schedtune_cpu_update(int cpu)
{
+ bool active_tasks = false;
struct boost_groups *bg;
- unsigned int boost_max;
+ int boost_max = -100;
int idx;

bg = &per_cpu(cpu_boost_groups, cpu);

- /* The root boost group is always active */
- boost_max = bg->group[0].boost;
- for (idx = 1; idx < boostgroups_max; ++idx) {
+ for (idx = 0; idx < boostgroups_max; ++idx) {
/*
* A boost group affects a CPU only if it has
* RUNNABLE tasks on that CPU
@@ -128,8 +127,13 @@ schedtune_cpu_update(int cpu)
if (bg->group[idx].tasks == 0)
continue;
boost_max = max(boost_max, bg->group[idx].boost);
+ active_tasks = true;
}

+ /* Reset boosting when there are not tasks in the system */
+ if (!active_tasks)
+ boost_max = 0;
+
bg->boost_max = boost_max;
}

@@ -383,7 +387,7 @@ void schedtune_exit_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, &rq_flags);
}

-static u64
+static s64
boost_read(struct cgroup_subsys_state *css, struct cftype *cft)
{
struct schedtune *st = css_st(css);
@@ -393,15 +397,18 @@ boost_read(struct cgroup_subsys_state *css, struct cftype *cft)

static int
boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
- u64 boost)
+ s64 boost)
{
struct schedtune *st = css_st(css);

- if (boost > 100)
+ if (boost < -100 || boost > 100)
return -EINVAL;
+
+ /* Update boostgroup and global boosting (if required) */
st->boost = boost;
if (css == &root_schedtune.css)
sysctl_sched_cfs_boost = boost;
+
/* Update CPU boost */
schedtune_boostgroup_update(st->idx, st->boost);

@@ -411,8 +418,8 @@ boost_write(struct cgroup_subsys_state *css, struct cftype *cft,
static struct cftype files[] = {
{
.name = "boost",
- .read_u64 = boost_read,
- .write_u64 = boost_write,
+ .read_s64 = boost_read,
+ .write_s64 = boost_write,
},
{ } /* terminate */
};
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 12c3432..3b412fb 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -127,6 +127,7 @@ static int __maybe_unused four = 4;
static unsigned long one_ul = 1;
static int one_hundred = 100;
static int one_thousand = 1000;
+static int __maybe_unused one_hundred_neg = -100;
#ifdef CONFIG_PRINTK
static int ten_thousand = 10000;
#endif
@@ -453,7 +454,7 @@ static struct ctl_table kern_table[] = {
.mode = 0644,
#endif
.proc_handler = &sysctl_sched_cfs_boost_handler,
- .extra1 = &zero,
+ .extra1 = &one_hundred_neg,
.extra2 = &one_hundred,
},
#endif
--
2.10.1