Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller

From: Patrick Bellasi
Date: Mon Jul 15 2019 - 09:38:13 EST

Next message: Sasha Levin: "[PATCH AUTOSEL 5.1 001/219] ath10k: Check tx_stats before use it"
Previous message: Sasha Levin: "[PATCH AUTOSEL 5.2 024/249] selftests/bpf: adjust verifier scale test"
In reply to: Quentin Perret: "Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller"
Next in thread: Patrick Bellasi: "[PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 08-Jul 12:08, Quentin Perret wrote:
> Hi Patrick,

Hi Quentin!

> On Monday 08 Jul 2019 at 09:43:53 (+0100), Patrick Bellasi wrote:
> > +static inline int uclamp_scale_from_percent(char *buf, u64 *value)
> > +{
> > + *value = SCHED_CAPACITY_SCALE;
> > +
> > + buf = strim(buf);
> > + if (strncmp("max", buf, 4)) {
> > + s64 percent;
> > + int ret;
> > +
> > + ret = cgroup_parse_float(buf, 2, &percent);
> > + if (ret)
> > + return ret;
> > +
> > + percent <<= SCHED_CAPACITY_SHIFT;
> > + *value = DIV_ROUND_CLOSEST_ULL(percent, 10000);
> > + }
> > +
> > + return 0;
> > +}
> > +
> > +static inline u64 uclamp_percent_from_scale(u64 value)
> > +{
> > + return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE);
> > +}
>
> FWIW, I tried the patches and realized these conversions result in a
> 'funny' behaviour from a user's perspective. Things like this happen:
>
> $ echo 20 > cpu.uclamp.min
> $ cat cpu.uclamp.min
> 20.2
> $ echo 20.2 > cpu.uclamp.min
> $ cat cpu.uclamp.min
> 20.21
>
> Having looked at the code, I get why this is happening, but I'm not sure
> if a random user will. It's not an issue per se, but it's just a bit
> weird.

Yes, that's what we get if we need to use a "two decimal digit
precision percentage" to represent a 1024 range in kernel space.

I don't think the "percent <=> utilization" conversion code can be
made more robust. The only possible alternative I see to get back
exactly what we write in, is to store the actual request in kernel
space, alongside its conversion to the SCHED_CAPACITY_SCALE required by the
actual scheduler code.

Something along these lines (on top of what we have in this series):

---8<---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ddc5fcd4b9cf..82b28cfa5c3f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7148,40 +7148,35 @@ static void cpu_util_update_eff(struct cgroup_subsys_state *css)
}
}

-static inline int uclamp_scale_from_percent(char *buf, u64 *value)
+static inline int uclamp_scale_from_percent(char *buf, s64 *percent, u64 *scale)
{
- *value = SCHED_CAPACITY_SCALE;
+ *scale = SCHED_CAPACITY_SCALE;

buf = strim(buf);
if (strncmp("max", buf, 4)) {
- s64 percent;
int ret;

- ret = cgroup_parse_float(buf, 2, &percent);
+ ret = cgroup_parse_float(buf, 2, percent);
if (ret)
return ret;

- percent <<= SCHED_CAPACITY_SHIFT;
- *value = DIV_ROUND_CLOSEST_ULL(percent, 10000);
+ *scale = *percent << SCHED_CAPACITY_SHIFT;
+ *scale = DIV_ROUND_CLOSEST_ULL(*scale, 10000);
}

return 0;
}

-static inline u64 uclamp_percent_from_scale(u64 value)
-{
- return DIV_ROUND_CLOSEST_ULL(value * 10000, SCHED_CAPACITY_SCALE);
-}
-
static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
char *buf, size_t nbytes,
loff_t off)
{
struct task_group *tg;
u64 min_value;
+ s64 percent;
int ret;

- ret = uclamp_scale_from_percent(buf, &min_value);
+ ret = uclamp_scale_from_percent(buf, &percent, &min_value);
if (ret)
return ret;
if (min_value > SCHED_CAPACITY_SCALE)
@@ -7197,6 +7192,9 @@ static ssize_t cpu_uclamp_min_write(struct kernfs_open_file *of,
/* Update effective clamps to track the most restrictive value */
cpu_util_update_eff(of_css(of));

+ /* Keep track of the actual requested value */
+ tg->uclamp_pct[UCLAMP_MIN] = percent;
+
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);

@@ -7209,9 +7207,10 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
{
struct task_group *tg;
u64 max_value;
+ s64 percent;
int ret;

- ret = uclamp_scale_from_percent(buf, &max_value);
+ ret = uclamp_scale_from_percent(buf, &percent, &max_value);
if (ret)
return ret;
if (max_value > SCHED_CAPACITY_SCALE)
@@ -7227,6 +7226,9 @@ static ssize_t cpu_uclamp_max_write(struct kernfs_open_file *of,
/* Update effective clamps to track the most restrictive value */
cpu_util_update_eff(of_css(of));

+ /* Keep track of the actual requested value */
+ tg->uclamp_pct[UCLAMP_MAX] = percent;
+
rcu_read_unlock();
mutex_unlock(&uclamp_mutex);

@@ -7251,7 +7253,7 @@ static inline void cpu_uclamp_print(struct seq_file *sf,
return;
}

- percent = uclamp_percent_from_scale(util_clamp);
+ percent = tg->uclamp_pct[clamp_id];
percent = div_u64_rem(percent, 100, &rem);
seq_printf(sf, "%llu.%u\n", percent, rem);
}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0e37f4a4e536..4f9b0c660310 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -395,6 +395,8 @@ struct task_group {
struct cfs_bandwidth cfs_bandwidth;

#ifdef CONFIG_UCLAMP_TASK_GROUP
+ /* The two decimal precision [%] value requested from user-space */
+ unsigned int uclamp_pct[UCLAMP_CNT];
/* Clamp values requested for a task group */
struct uclamp_se uclamp_req[UCLAMP_CNT];
/* Effective clamp values used for a task group */
---8<---

> I guess one way to fix this would be to revert back to having a
> 1024-scale for the cgroup interface too ... Though I understand Tejun
> wanted % for consistency with other things.

Yes that would be another option, which will also keep aligned the per-task
and system-wide APIs with the CGroups one. Although, AFAIU, having two
different APIs is not considered a major issue.

> So, I'm not sure if this is still up for discussion, but in any case I
> wanted to say I support your original idea of using a 1024-scale for the
> cgroups interface, since that would solve the 'issue' above and keeps
> things consistent with the per-task API too.

Right, I'm personally more leaning toward either going back to use
SCHED_CAPACITY_SCALE or the add the small change I suggested above.

Tejun, Peter: any preference? Alternative suggestions?

> Thanks,
> Quentin

Cheers,
Patrick

--
#include <best/regards.h>

Patrick Bellasi

Next message: Sasha Levin: "[PATCH AUTOSEL 5.1 001/219] ath10k: Check tx_stats before use it"
Previous message: Sasha Levin: "[PATCH AUTOSEL 5.2 024/249] selftests/bpf: adjust verifier scale test"
In reply to: Quentin Perret: "Re: [PATCH v11 1/5] sched/core: uclamp: Extend CPU's cgroup controller"
Next in thread: Patrick Bellasi: "[PATCH v11 2/5] sched/core: uclamp: Propagate parent clamps"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]