Re: [PATCH 2/2] sched_ext: Add cpuperf support

From: Tejun Heo
Date: Tue Jul 30 2024 - 21:06:08 EST


Hello, Qais.

On Thu, Jul 25, 2024 at 12:45:27AM +0100, Qais Yousef wrote:
> On 06/18/24 17:12, Tejun Heo wrote:
> > sched_ext currently does not integrate with schedutil. When schedutil is the
> > governor, frequencies are left unregulated and usually get stuck close to
> > the highest performance level from running RT tasks.
>
> Have you tried to investigate why is that? By default RT run at max frequency.
> Only way to prevent them from doing that is by using uclamp
>
> https://kernel.org/doc/html/latest/scheduler/sched-util-clamp.html#sched-util-clamp-min-rt-default
>
> If that's not the cause, then it's likely something else is broken.

Nothing is necessarily broken. SCX just wasn't providing any signal to
schedutil before this patch, so schedutil ends up just going by with the
occasional signals from RT.

...
> What is exactly the problem you're seeing? You shouldn't need to set
> performance directly. Are you trying to fix a problem, or add a new feature?

I'm having a hard time following the question. When active, the BPF
scheduler is the only one who can tell how much each CPU is being used, and
the straightforward to integrate with schedutil is by indicating what the
current CPU utilization level is which is the signal that schedutil takes
from each scheduler class.

> > This gives direct control over CPU performance setting to the BPF scheduler.
>
> Why would we need to do that? schedutil is supposed to operate in utilization

Because nobody else knows? No one *can* know. Schedutil is driven by the
utilization signal from the scheduler. Here, the scheduler is implemented in
BPF. Nothing else knows how much workload each CPU is going to get. Note
that the kernel side does not have any meaningful information re. task <->
CPU relationship. That can change on every dispatch. Only the BPF scheduler
itself knows how the CPUs are going to be used.

> signal. Overriding it with custom unknown changes makes it all random governor
> based on what's current bpf sched_ext is loaded? This make bug reports and
> debugging problems a lot harder.

If schedutil is misbehaving while an SCX scheduler is loaded, it's the BPF
scheduler's fault. There isn't much else to it.

> I do hope by the way that loading external scheduler does cause the kernel to
> be tainted. With these random changes, it's hard to know if it is a problem in
> the kernel or with external out of tree entity. Out of tree modules taint the
> kernel, so should loading sched_ext.

Out of tree modules can cause corruptions which can have lasting impacts on
the system even after the module is unloaded. That's why we keep the
persistent taint flags - to remember that the current misbehavior has a
reasonable chance of being impacted by what happened to the system
beforehand. Kernel bugs aside, SCX schedulers shouldn't leave persistent
impacts on the system. In those cases, we usually mark the dumps which SCX
already does.

> It should not cause spurious reports, nor prevent us from changing the code
> without worrying about breaking out of tree code.

Oh yeah, we can and will make compatibility breaking changes. Not
willy-nilly, hopefully.

...
> > + /*
> > + * The heuristics in this function is for the fair class. For SCX, the
> > + * performance target comes directly from the BPF scheduler. Let's just
> > + * follow it.
> > + */
> > + if (scx_switched_all())
> > + return false;
>
> Why do you need to totally override? What problems did you find in current util
> value and what have you done to try to fix it first rather than override it
> completely?

Because it's way cleaner this way. Otherwise, we need to keep calling into
update_blocked_averages() so that the fair class's util metrics can decay
(it often doesn't decay completely due to math inaccuracies but Vincent was
looking at it). When switched_all(), the fair class is not used at all and
keeping calling it to decay the utility metrics to zero seems kinda silly.

...
> > +/**
> > + * scx_bpf_cpuperf_cap - Query the maximum relative capacity of a CPU
> > + * @cpu: CPU of interest
> > + *
> > + * Return the maximum relative capacity of @cpu in relation to the most
> > + * performant CPU in the system. The return value is in the range [1,
> > + * %SCX_CPUPERF_ONE]. See scx_bpf_cpuperf_cur().
> > + */
> > +__bpf_kfunc u32 scx_bpf_cpuperf_cap(s32 cpu)
> > +{
> > + if (ops_cpu_valid(cpu, NULL))
> > + return arch_scale_cpu_capacity(cpu);
> > + else
> > + return SCX_CPUPERF_ONE;
> > +}
>
> Hmm. This is tricky. It looks fine, but I worry about changing how we want to
> handle capacities in the future and then being tied down forever with out of
> tree sched_ext not being able to load.
>
> How are we going to protect against such potential changes? Just make it a NOP?

So, if it's a necessary change, we just break the API and update the
out-of-tree schedulers accordingly. With BPF's CO-RE and supporting
features, we can handle quite a bit of backward compatibility - ie. it's
cumbersome but usually possible to provide BPF-side helpers that allow
updated BPF schedulers to be loaded in both old and new kernels, so it
usually isn't *that* painful.

> A bit hypothetical but so far these are considered internal scheduler details
> that could change anytime with no consequence. With this attaching to this info
> changing them will become a lot harder as there's external dependencies that
> will fail to load or work properly. And what is the regression rule in this
> case?

This is not different from any other BPF hooks and has been discussed many
times. As I wrote before, we don't want to change for no reason but we
definitely can change things if necessary.

> You should make all functions return an error to future proof them against
> suddenly disappearing.

Really, no need to be that draconian.

...
> > +__bpf_kfunc void scx_bpf_cpuperf_set(u32 cpu, u32 perf)
> > +{
> > + if (unlikely(perf > SCX_CPUPERF_ONE)) {
> > + scx_ops_error("Invalid cpuperf target %u for CPU %d", perf, cpu);
> > + return;
> > + }
> > +
> > + if (ops_cpu_valid(cpu, NULL)) {
> > + struct rq *rq = cpu_rq(cpu);
> > +
> > + rq->scx.cpuperf_target = perf;
> > +
> > + rcu_read_lock_sched_notrace();
> > + cpufreq_update_util(cpu_rq(cpu), 0);
> > + rcu_read_unlock_sched_notrace();
> > + }
> > +}
>
> Is the problem that you break how util signal works in sched_ext? Or you want
> the fine control? We expect user application to use uclamp to set their perf
> requirement. And sched_ext should not break util signal, no? If it does and
> there's a good reason for it, then it is not compatible with schedutil, as the
> name indicates it operates on util signal as defined in PELT.
>
> You can always use min_freq/max_freq in sysfs to force min and max frequencies
> without hacking the governor. I don't advise it though and I'd recommend trying
> to be compatible with schedutil as-is rather than modify it. Consistency is
> a key.

It's not hacking the governor and uclamp should work the same way on top.
The BPF scheduler is the only thing that can tell how and how much a given
CPU is going to be used. There is no way for the kernel to project per-CPU
utilization without asking the BPF scheduler.

Thanks.

--
tejun