[PATCH 0/1] [RFC] Morphing CFS into FDL, The Fair Deadline Scheduling Class

From: Xi Wang
Date: Sat Apr 01 2023 - 19:06:40 EST


This is about putting every thread under some aggregated deadline.

Note: In draft status, one big patch, not always conforming to coding
style etc. Only hooked to cgroup v1 at the moment. Based on 6.1.21.
Usage case, scheduling model and sched policy are more important than
the details.


Motivation:

Shared servers running multiple, often unrelated jobs in cgroup based
containers. Ideally each job runs under the illusion that it has the
whole machine or a slice of the machine, not affected by other jobs.
A challenge is that we want to maintain the illusion while
overcommitting cpu bandwidth for efficiency. Most of the time this is
possible due to statistical multiplexing effect, at least in theory.

Another challenge is some jobs on the shared machines are by
definition real time workloads, which often happens due to the
fan-out pattern. An upper layer service may need to query many lower
layer services before it can assemble the result and respond to a
query resulting in amplified tail latency. The median latency of the
upper layer query is determined by the tail latency of the lower
layer queries. Bonded tail latency for lower layer services becomes
much more important. We can end up running real-time servers without
knowing it.


Main goals:

- Sched latency guarantees
- CPU bandwidth guarantees
- Real time support for all workloads, be generic and scalable
- Per cgroup policies
- Hierarchical scheduling - typically real-time for first level
cgroups and best effort for lower cgroups
- Seamless transition for existing workloads


Scheduling model:

FDL is based on my draining buckets model which still needs a more
detailed definition. However the core idea should be intuitive. The
model was inspired by media streaming applications. It might be
considered a type of deadline scheduling and/or equivalent to some
other models. If a device receives an audio stream, processes it and
plays it through a speaker, the rx buffer can overflow if the cpu is
on something else for too long, then we get audio glitches. To avoid
buffer overflows, the receiving thread has to get enough cpu cycles
and there cannot be off-cpu gaps too large, thus the requirements are
on both latency and cpu bandwidth. If we model every workload this
way we get the draining buckets model. The scheduler's job is
simply about preventing any bucket from overflowing. This is a
continuous model with no event arrival intervals or quota
replenishment periods, etc. The round robin deadline is simply the
time to fill up an empty bucket.

The model was designed to be resilient, backward compatible and
natively support overcommit. It prefers graceful degradation over
admission control.


Interface:

Each cpu cgroup controller has 3 new parameters (per thread control
is not supported at the moment).

cpu.deadline_rr: round robin latency in ns
cpu.deadline_wakeup: optional wake up boost, latency in ns
cpu.guaranteed_shares: cpu bandwidth allocation, 1024 == 1 cpu for
first level cgroups, relative for lower cgroups


Implementation:

Backward compatibility is a major challenge to a generic latency
aware scheduling policy. Many applications have adapted to various
CFS behaviors over the years. Moving all of them to a new sched class
without causing a few performance regressions is rather difficult.
Moving only the latency sensitive applications to a new sched class
does not make the problem much easier - two scheduling classes can
easily create priority inversion problems. Existing high priority
scheduling classes do have mechanisms to yield to low priority sched
classes to prevent starvation, yet these mechanisms are difficult to
tune and they can hurt the tail latency of high priority sched
classes. The dilemma is new low latency applications can be much
better supported if all threads are put in a new sched class, but old
applications have a much better chance of still working if they are
still in the old sched class.

This patch might be an unexpected take on the tradeoffs. It took a
shortcut by directly modifying CFS to add latency support. Individual
cgroups can be flipped to FDL and the remaining cgroups should see
minimal disruption. ("Real-time containers" might sound like a
joke but FDL is actually supporting them.) It also reuses the
recursive mechanisms of fair group scheduling to support hierarchical
scheduling. I also do not expect moving to a clean sheet design later
will be made more difficult by FDL.

CFS is already similar to a deadline scheduler in that it picks the
next task from a vruntime based rb tree. A deadline based rb tree is
added in parallel to the vruntime tree to support latency based
scheduling. The deadline tree is examined before the vruntime tree on
pick next.

CFS is not a global scheduler so some compromises were made for
multi-core scheduling. The deadline based pick next is cpu local but
deadline awareness was added to the wake up load balancing path to
even out the deadline pressure. A feedback loop is used to
dynamically adjust the cgroup cpu shares and provide cpu bandwidth
guarantee.


Compared to EEVDF:

I only have a shallow understanding of EEVDF scheduler so the
comparison is also limited. They have both similarities and
differences. The most prominent similarity might be that both are
latency aware, generic sched classes taking the place of CFS. Some
differences are: FDL tries to be as real time as possible, supporting
configurable deadlines down to microseconds. FDL also aims to be non
disruptive to user applications. The threads not configured to run
under FDL should still see CFS behavior with minimal change, which is
a reason for modifying CFS instead of creating a completely new sched
class.


Example:

The script below launches 4 workloads in 4 cgroups and run them
concurrently: hackbench as the antagonist. 3 schbench with identical
command line parameters but different cgroup configurations as
workloads. One schbench instance runs under cfs and the other two
instances run under fdl with different deadline parameters.

The schbench stats from one of the test runs (2 socket x 24 core / 96
hyperthread cascadelake):

==== cfs ====
warmup done, zeroing stats
Latency percentiles (usec) runtime 30 (s) (8496 total samples)
50.0th: 417 (4248 samples)
75.0th: 2268 (2129 samples)
90.0th: 5512 (1270 samples)
95.0th: 9168 (429 samples)
*99.0th: 19936 (337 samples)
99.5th: 25440 (42 samples)
99.9th: 37952 (35 samples)
min=7, max=59690

==== fdl loose ====
warmup done, zeroing stats
Latency percentiles (usec) runtime 30 (s) (8544 total samples)
50.0th: 52 (4304 samples)
75.0th: 68 (2167 samples)
90.0th: 84 (1240 samples)
95.0th: 97 (411 samples)
*99.0th: 1378 (338 samples)
99.5th: 2428 (44 samples)
99.9th: 5608 (32 samples)
min=6, max=23616

==== fdl ====
warmup done, zeroing stats
Latency percentiles (usec) runtime 30 (s) (8534 total samples)
50.0th: 48 (4325 samples)
75.0th: 63 (2210 samples)
90.0th: 76 (1204 samples)
95.0th: 85 (387 samples)
*99.0th: 115 (323 samples)
99.5th: 725 (43 samples)
99.9th: 5400 (35 samples)
min=1, max=12046

Data extracted from 3 test runs with antagonist and 3 test runs
without antagonist

p50 latency p99 latency
cfs with antagonist 300, 280, 417 18528, 20960, 19339
fdl-loose with antagonist 52, 51, 52 1722, 1942, 1378
fdl with antagonist 51, 49, 48 609, 234, 115

cfs no antagonist 38, 85, 69 123, 129, 124
fdl-loose no antagonist 85, 54, 84 128, 129, 124
fdl no antagonist 53, 52, 63 129, 118, 128


The test script:

antagonist=true

CR=<cgroup root>

killall hackbench

if $antagonist; then
mkdir -p $CR/antagonist
echo $$ > $CR/antagonist
hackbench -g 10 -f 10 -s 20000 --loops 1000000 > /dev/null&
sleep 5
fi

mkdir -p $CR/cfs
echo $$ > $CR/cfs/tasks
echo 0 > $CR/cfs/cpu.deadline_rr
echo 0 > $CR/cfs/cpu.deadline_wakeup
echo 0 > $CR/cfs/cpu.guaranteed_shares
schbench -r 30 -i -1 &> log-cfs &

mkdir -p $CR/fdl-loose
echo $$ > $CR/fdl-loose/tasks
echo 5000000 > $CR/fdl-loose/cpu.deadline_rr
echo 2500000 > $CR/fdl-loose/cpu.deadline_wakeup
echo 7000 > $CR/fdl-loose/cpu.guaranteed_shares
schbench -r 30 -i -1 &> log-fdl-loose &

mkdir -p $CR/fdl
echo $$ > $CR/fdl/tasks
echo 200000 > $CR/fdl/cpu.deadline_rr
echo 100000 > $CR/fdl/cpu.deadline_wakeup
echo 7000 > $CR/fdl/cpu.guaranteed_shares
schbench -r 30 -i -1 &> log-fdl &


echo $$ > $CR/tasks

sleep 40

echo ==== cfs ====
cat log-cfs
echo
echo ==== fdl loose ====
cat log-fdl-loose
echo
echo ==== fdl ====
cat log-fdl
echo

killall hackbench




Xi Wang (1):
[RFC] sched: Morphing CFS into FDL, The Fair Deadline Scheduling Class

include/linux/sched.h | 7 +
kernel/sched/Makefile | 1 +
kernel/sched/core.c | 162 ++++++-
kernel/sched/cpuacct.c | 5 +
kernel/sched/debug.c | 10 +
kernel/sched/fair.c | 927 +++++++++++++++++++++++++++++++++++++-
kernel/sched/features.h | 2 +-
kernel/sched/qos.c | 305 +++++++++++++
kernel/sched/qos.h | 135 ++++++
kernel/sched/qos_params.h | 30 ++
kernel/sched/sched.h | 60 +++
kernel/sched/stats.h | 10 +
12 files changed, 1620 insertions(+), 34 deletions(-)
create mode 100644 kernel/sched/qos.c
create mode 100644 kernel/sched/qos.h
create mode 100644 kernel/sched/qos_params.h

--
2.40.0.348.gf938b09366-goog