Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

From: Qais Yousef
Date: Mon May 27 2024 - 16:29:36 EST


On 05/17/24 11:58, Peter Zijlstra wrote:

> > I really don't think the problems we have are because of EEVDF vs CFS vs
> > anything else. Other major OSes have one scheduler, but what they exceed on is
> > providing better QoS interfaces and mechanism to handle specific scenarios that
> > Linux lacks.
>
> Quite possibly. The immediate problem being that adding interfaces is
> terrifying. Linus has a rather strong opinion about breaking stuff, and
> getting this wrong will very quickly result in a paint-into-corner type
> problem.

We need to move forward though. Let us find an approach and agree with Linus on
what will constitute regressions when things have to disappear.

My general worry is more about default behavior not just interfaces. As pointed
out below, the default behavior favoured throughput for a long time because
folks who care about throughput were more vocal, but it seems we have a silent
majority problem of people who need latency by default but find no path
forward.

And of course it'll never be possible to make them both happy by default.
Regression reports in this area need to consider the wider impact on other
users when deciding whether it needs to be fixed or not.

We do seem to be held hostages sometimes by older systems/workloads who IMHO if
can't move to use new facilities provided, have no good reason to complain
about regressions as we need to look forward for what new workloads and system
need by default. The world is moving on too fast - but we can't catch up due to
these regression reports. We need a better balance IMHO.

>
> We can/could add fields to sched_attr under the understanding that
> they're purely optional and try thing, *however* too many such fields
> and we're up a creek again.

My personal vision on this is this

https://lore.kernel.org/lkml/20230916213316.p36nhgnibsidoggt@airbuntu/

We don't need to continue to add new fields as this is a problem actually when
it comes to integrating to libc (who yet to have proper wrappers in pthreads
for the things we added). A u32 should be virtually infinite number of hints.
We should be able to deprecate at ease by making a specific hint type return an
error when it longer supported (-ENOSYS). We can even create uclamp alias for
this (that is called performance_hint given how widely people interpret uclamp
as a bandwidth hint) and make it the soruce of QoS truth.

> > Similarly for latency vs throughput. What is the correct way to
> > write an application to provide this info? Then we can ask what is missing in
> > the scheduler to enable this.
>
> Right, so the EEVDF thing is a start here. By providing a per task
> request size, applications can indicate if they want frequent and short
> activations or more infrequent longer activations.
>
> An application can know it's (average) activation time, the kernel has
> no clue when work starts and is completed. Applications can fairly
> trivially measure this using CLOCK_THREAD_CPUTIME_ID reads before and
> after and communicate this (very much like SCHED_DEADLINE).

I fear the concept of time in userspace will be hard to get right without some
further help from us due to DVFS/HMP having a Black Hole effect and causing
extreme Time Dilution problem. It is actually a problem for schedutil that I am
trying to find a reasonable fix for as part of my magic margins series. On one
system I ran a test on, it took 30ms to take off from util 0. And the system
stayed running at the lowest frequency for 42ms! Our utilization invariance is
very good for estimating compute demand, but terrible for bursty tasks - which
I are very common on interactive systems. I think this is a cause of many
'latency' woos in general. Things can end up running slower for longer. But
this is a different problem for a different series/day.

FWIW even for userspace trying to create dynamic uclamp control are struggling
because they can't measure time reliable. A task can seem happy, but only
because something else on another CPU with shared policy had a 'heavy' tasks
running. As soon as this goes to sleep things look really different from the
tasks runtime perspective. It was running super fast by accident.

The average time will be hard to get right in general due to the interactive
nature for some workloads and things could have big variations in practice
based on my experience. Ie; the frame to frame variations could be larger than
expected.

I think it is a helpful interface, but won't address all workload demands. The
thing I care about for example they really want to run ASAP and that's it. They
could run for a longer period of time, or a shorter period of time. next_buddy
type of behavior will help these tasks. But might need to be stronger than
current implementation. I am trying to find out..

And oversubscribed scenarios are important. It is common to have a sudden surge
of activities that cause delays that requires load balancer's help to better
distribute as some CPUs can get less busy sooner but wakeup preemption won't
save those already enqueued tasks from getting CPU time ASAP without some
additional external trigger. In contrary, I do see problems today (older CFS
LTS kernel) where a surge of short running tasks can delay enqueued tasks
considerably. I have no clue what's going on yet. I don't have a reproducer but
creeps up often enough when I look at traces.

Generally I think latency is more important in majority of systems these days
and it might be better to default to more responsive system and let those who
want throughput to opt-in, rather than the other way around.

In theory, there should be (very) few tasks in the system that are actually
need the next_buddy type of behavior to skip the queue and run ASAP if the
average default latency is good (1-2ms).

I also think we need to enable HRTICK by default too. We will have more timely
preemption points then. I generally think sched_feat should be a good way to
give admins the power to control certain aspects of the scheduler. We can also
make uclamp a sched_feat and ensure it can be made available on any system
- unlike today where it's not enabled by default on Debian at least and this
hit me and looks like the Asahi folks who managed to get good power
improvements and thankfully has higher weight than me asking for this to be
enabled by default. Read Energy Aware Scheduling section here

https://asahilinux.org/2024/01/fedora-asahi-new/

As a general topic for discussion not just for scheduler, there are core
features that must always be there from programmer's perspective. We are
shooting ourselves in the foot here by being too flexible with our usage of
CONFIGs, IMHO :)

Too much random babbling from my side maybe, but I think there's a series of
seemingly independent issues that are actually interconnected and one is
leading to the other but people are trying to find the one root-cause which
I don't think exists.

>
> Anyway, yes, userspace needs to change and provide more information. The
> trick ofcourse is figuring out which bit of information is critical /
> useful etc.
>
> There is a definite limit on the amount of constraints you want to solve
> at runtime.

+1

>
> Everybody going off and hacking their own thing does not help, we need
> collaboration to figure out what it is that is needed.

+2 - I've been trying to snoop on many use cases to further understand what
truly goes wrong. Some of the issues I've seen were actually due to bugs in the
kernel. Other issues could already be fixed with existing facilities, but
users didn't know how to use them. So the task is not easy to untangle.

>
> > Note the original min/wakeup_granularity_ns, latency_ns etc were tuned by
> > default for throughput by the way (server market bias). You can manipulate
> > those and get better latencies.
>
> The immediate problem with those knobs is that they are system wide. But
> yes, everybody was randomly poking them knobs, sometimes in obviously
> insane ways.

Yes. I fear though that because they were system wide is that the
out-of-the-box experience for many (especially with CFS defaults) were bad
latencies. I like the new 3ms base_slice_ns, but I think for many who care
about 120Hz refresh for example this is too large. It's almost half of the
frame time.

The TICK value plays a big role too. On 4ms TICK, this 3ms will become 4ms if
wakeup preemption decided not to preempt immediately.

Is this 3ms a constant by the way? I see it still depends on NR_CPUS, but
I read it on different systems and I got 3ms. I think having a constant value
across all systems makes more sense. With EAS (which I think someone should put
effort to enable it for SMP systems) we tend to pack. And a lot of systems have
too few of CPUs and things being packed is common case - I think the rationale
in the past was that we distribute tasks to idle CPUs at wake up which is good
for latency, but I don't know if this is a good assumption to make still to
decide these values.

And looks like we have a bug. I didn't spend a lot of time on studying EEVDF
impact on latencies, but I had this simple run with my pi_test [1]. You need
sched-analyzer/sched-analyzer-pp somewhere in your path which you can download
from [2]. setup Perfetto traced [3]. Running on 6.8.8 M1 Mac Mini

./run.sh 0 0 0

=====================================
:: 2255 | pi_test | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.25
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.75

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.75

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1149.0 3.95 1.12 -0.0 4.0 4.0 6.0 6.01 7.0 7.01
Running 1149.0 3.95 1.11 0.0 4.0 4.0 6.0 6.00 7.0 7.00

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1149.0 3.95 1.11 0.0 4.0 4.0 6.0 6.0 7.0 7.0



=========================================
:: 2257 | pi_test_low | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.89
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.11

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4533.11

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1154.0 3.93 1.14 0.0 4.0 4.0 6.0 6.0 7.0 7.0
Running 1154.0 3.93 1.13 -0.0 4.0 4.0 6.0 6.0 7.0 7.0

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1154.0 3.93 1.13 -0.0 4.0 4.0 6.0 6.0 7.0 7.0

Note that the average RUNNING (R is for RUNNABLE) time is ~4ms instead of 3ms.
Oour P90 and max values are almost double the 3ms slice. I am running with 1ms
TICK, so I think there has to be a bug somewhere preventing timely preemption..

If I enable HRTICK it looks much better

=====================================
:: 2517 | pi_test | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4911.05
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4911.96

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.0

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4911.96

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1645.0 2.99 0.22 0.0 3.0 3.0 3.0 3.0 3.01 3.97
Running 1646.0 2.98 0.17 -0.0 3.0 3.0 3.0 3.0 3.00 3.01

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1646.0 2.98 0.17 -0.0 3.0 3.0 3.0 3.0 3.0 3.01



=========================================
:: 2519 | pi_test_low | ['./pi_test'] ::
====================================================================================================
----------------------------------------------------------------------------------------------------
───────────────────────────── Sum Time in State Exclude Sleeping (ms) ──────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4912.11
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4910.89

────────────────────────────── % Time in State Exclude Sleeping (ms) ───────────────────────────────
R ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 50.01
Running ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 49.99

─────────────────────────────────── Sum Time Running on CPU (ms) ───────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4910.89

──────────────────────────────────── % Time Running on CPU (ms) ────────────────────────────────────
CPU0.0 ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 100.0

Time in State (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
state
R 1644.0 2.99 0.19 -0.00 3.0 3.0 3.0 3.0 3.01 3.95
Running 1643.0 2.99 0.16 0.51 3.0 3.0 3.0 3.0 3.00 3.01

Time Running on CPU (ms):
----------------------------------------------------------------------------------------------------
count mean std min 50% 75% 90% 95% 99% max
cpu
0.0 1643.0 2.99 0.16 0.51 3.0 3.0 3.0 3.0 3.0 3.01

[1] https://github.com/qais-yousef/pi_test
[2] https://github.com/qais-yousef/sched-analyzer/releases
[3] https://github.com/qais-yousef/sched-analyzer?tab=readme-ov-file#perfetto-mode

>
> > FWIW IMO the biggest issues I see in the scheduler is that its testability and
> > debuggability is hard. I think BPF can be a good fit for that. For the latter
> > I started this project, yet I am still trying to figure out how to add tracer
> > for the difficult paths to help people more easily report when a bad decision
> > has happened to provide more info about the internal state of the scheduler, in
> > hope to accelerate the process of finding solutions.
>
> So the pitfalls here are that exposing that information for debug
> purposes can/will lead to people consuming this information for
> non-debug purposes and then when we want to change things we're stuck
> because suddenly someone relies something we believed was an
> implementation detail :/
>
> I've been bitten by this before and this is why I'm so very hesitant to
> put tracepoints in the scheduler.

I was hoping the 'bare' tracepoint approach I added is okay? I don't need more
than that. Function signature and structure internals can never be ABIs.
I already had to deal with util_est changes across kernel versions.

If our emperor penguin is reading, it'd be great if he has new thoughts on
debug features and userspace dependency. I think we really need to help people
to better debug and understand why things aren't behaving as they anticipate.
Or at least make it easier to provide info on the list to help us understand
what could have gone wrong.

Your concerns are real. These should not prevent code from moving on without
worrying about breakages. If anyone latched into those I hope we can tell them
sorry, but this one is expected breakage.. I think by design the bare
tracepoints can never be ABI though.

> > From what I see, I am hitting bugs here and there
> > all the time. But they are hard to debug to truly understand where things went
> > wrong. Like this one for example where PTHREAD_PRIO_PI is a NOP for fair tasks.
> > Many thought using this flag doesn't help (rather than buggy)..
>
> Yay for the terminal backlog :/ I'll try and have a look.

It seems hard to fix without Proxy Execution :( If you have ideas for
a temporary solution that'd be great. But looks like we just need to get PE
merged and available for users - I think John's series doesn't tie this to
futex_pi yet.