Re: [PATCH] sched/uclamp: Fix iowait boost UCLAMP_MAX escape
From: Qais Yousef
Date: Thu Mar 28 2024 - 19:10:25 EST
On 03/26/24 18:00, Christian Loehle wrote:
> A task, regardless of UCLAMP_MAX value, was previously allowed to
> build up the sg_cpu->iowait boost up to SCHED_CAPACITY_SCALE when
> enqueued. Since the boost was only uclamped when applied this led
> to sugov iowait boosting the rq while the task is dequeued.
>
> The fix introduced by
> commit d37aee9018e6 ("sched/uclamp: Fix iowait boost escaping uclamp restriction")
> added the uclamp check before the boost is applied. Unfortunately
> that is insufficient, as the iowait_boost may be built up purely by
> a task with UCLAMP_MAX task, but since this task is in_iowait often,
> the clamps are no longer active during the in_iowait periods.
> So another task (let's say with low utilization) may immediately
> receive the iowait_boost value previously built up under UCLAMP_MAX
> restrictions.
This is the intended behavior. Like utilization value, it should build up
normally but at key decision points it gets clamped and the result of this
clamping operation is used to make the decision. The reason is that this
performance restriction could be left at any point of time and the system/task
should go to the original behavior when this constraint is left.
Beside the current design for iowait boost doesn't differentiate between who
added the boost. So we are in for unnecessary complexity for a mechanism that
needs to evolve anyway.
As an alternative We could say that tasks with uclamp_max shouldn't cause
iowait boost to increase, which can be a reasonable assumption for many cases.
But not a safe one in practice as it makes assumptions on who should use
uclamp_max and restrict the benefit for other use cases. And we don't want to
impose restrictions on who can benefit from it.
>
> The issue is less prevalent than the above might suggest, since if
> the dequeuing of the UCLAMP_MAX set task will turn the cpu idle the
> previous UCLAMP_MAX value is preserved by uclamp_idle_value().
> Nonetheless anything being enqueued on the rq during the in_iowait
> phase will falsely receive the iowait_boost.
>
> Can be observed with a basic single-threaded benchmark running with
> UCLAMP_MAX of 0, the iowait_boost is then triggered by the occasional
> kworker.
I think this a combination of two problems:
1. The max aggregation problem that have been discussed several times already.
2. It is a limitation of the current mechanism. Moving to per-task iowait boost
we can do smarter behavior to handle this.
I think we should focus on handling these two issues. For this particular use
case, the latter is the major problem. Once the iowait boosted task is
dequeued, the CPU shouldn't need to run at faster frequency. The iowait boosted
tasks are usually short running too. So the latter improvement should mean the
CPU can move to lower frequency during its waiting time. Though we'll have to
see if the boost need to extend until the softirq has finished.
So overall I don't think we have a problem on how the hint is applied,
but we need to fix problems elsewhere to make the overall behavior better.
Thanks!
--
Qais Yousef
>
> Fixes: 982d9cdc22c9 ("sched/cpufreq, sched/uclamp: Add clamps for FAIR and RT tasks")
> Signed-off-by: Christian Loehle <christian.loehle@xxxxxxx>
> ---
> kernel/sched/cpufreq_schedutil.c | 36 +++++++++++++++++++++++++-------
> 1 file changed, 28 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index eece6244f9d2..bfd79762b28d 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -205,6 +205,25 @@ static void sugov_get_util(struct sugov_cpu *sg_cpu, unsigned long boost)
> sg_cpu->util = sugov_effective_cpu_perf(sg_cpu->cpu, util, min, max);
> }
>
> +/**
> + * sugov_iowait_clamp() - Clamp the boost with UCLAMP_MAX
> + * @sg_cpu: the sugov data for the CPU
> + * @boost: the requested new boost
> + *
> + * Clamps the iowait boost according to the rq's UCLAMP_MAX restriction.
> + */
> +static void sugov_iowait_clamp(struct sugov_cpu *sg_cpu, unsigned int boost)
> +{
> +#if CONFIG_UCLAMP_TASK
> + unsigned int boost_scaled = (boost *
> + arch_scale_cpu_capacity(sg_cpu->cpu)) >> SCHED_CAPACITY_SHIFT;
> +
> + if (uclamp_rq_get(cpu_rq(sg_cpu->cpu), UCLAMP_MAX) < boost_scaled)
> + return;
> +#endif
> + sg_cpu->iowait_boost = boost;
> + sg_cpu->iowait_boost_pending = true;
> +}
> /**
> * sugov_iowait_reset() - Reset the IO boost status of a CPU.
> * @sg_cpu: the sugov data for the CPU to boost
> @@ -225,8 +244,8 @@ static bool sugov_iowait_reset(struct sugov_cpu *sg_cpu, u64 time,
> if (delta_ns <= TICK_NSEC)
> return false;
>
> - sg_cpu->iowait_boost = set_iowait_boost ? IOWAIT_BOOST_MIN : 0;
> - sg_cpu->iowait_boost_pending = set_iowait_boost;
> + if (set_iowait_boost)
> + sugov_iowait_clamp(sg_cpu, IOWAIT_BOOST_MIN);
>
> return true;
> }
> @@ -249,6 +268,7 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> unsigned int flags)
> {
> bool set_iowait_boost = flags & SCHED_CPUFREQ_IOWAIT;
> + unsigned int iowait_boost;
>
> /* Reset boost if the CPU appears to have been idle enough */
> if (sg_cpu->iowait_boost &&
> @@ -262,17 +282,17 @@ static void sugov_iowait_boost(struct sugov_cpu *sg_cpu, u64 time,
> /* Ensure boost doubles only one time at each request */
> if (sg_cpu->iowait_boost_pending)
> return;
> - sg_cpu->iowait_boost_pending = true;
>
> /* Double the boost at each request */
> if (sg_cpu->iowait_boost) {
> - sg_cpu->iowait_boost =
> - min_t(unsigned int, sg_cpu->iowait_boost << 1, SCHED_CAPACITY_SCALE);
> - return;
> + iowait_boost = min_t(unsigned int, sg_cpu->iowait_boost << 1,
> + SCHED_CAPACITY_SCALE);
> + } else {
> + /* First wakeup after IO: start with minimum boost */
> + iowait_boost = IOWAIT_BOOST_MIN;
> }
>
> - /* First wakeup after IO: start with minimum boost */
> - sg_cpu->iowait_boost = IOWAIT_BOOST_MIN;
> + sugov_iowait_clamp(sg_cpu, iowait_boost);
> }
>
> /**
> --
> 2.34.1
>