[PATCH v2 27/35] sched: support preempt=none under PREEMPT_AUTO

From: Ankur Arora
Date: Mon May 27 2024 - 20:41:53 EST


The default preemption policy for the no forced preemption model under
PREEMPT_AUTO is to always schedule lazily for well-behaved, non-idle
tasks, preempting at exit-to-user.

We already have that, so enable it.

Comparing a scheduling/IPC workload:

# perf stat -a -e cs --repeat 10 -- perf bench sched messaging -g 20 -t -l 5000

PREEMPT_AUTO, preempt=none

3,173,961 context-switches ( +- 0.60% )

3.03058 +- 0.00621 seconds time elapsed ( +- 0.20% )

PREEMPT_DYNAMIC, preempt=none

2,942,664 context-switches ( +- 0.49% )

3.18924 +- 0.00483 seconds time elapsed ( +- 0.15% )

Both perform similarly, but we incur a slightly higher number of
context-switches with PREEMPT_AUTO.

Drilling down we see that both voluntary and involuntary
context-switches are higher for this test:

PREEMPT_AUTO, preempt=none

2286219.90 +- 39510.80 voluntary context-switches ( +- 1.72% )
887741.80 +- 20137.63 involuntary context-switches ( +- 2.26% )

PREEMPT_DYNAMIC, preempt=none

2125750.40 +- 29593.55 voluntary context-switches ( +- 1.39% )
816914.20 +- 13723.46 involuntary context-switches ( +- 1.67% )

Assuming voluntary context-switches due to explicit blocking are
similar, we expect that PREEMPT_AUTO will incur larger context
switches at exit-to-user (counted as voluntary) since that is its
default rescheduling point.

Involuntary context-switches, under PREEMPT_AUTO are seen when a
task has exceeded its time quanta by a tick. Under PREEMPT_DYNAMIC,
these are incurred when a task needs to be rescheduled and then
encounters a cond_resched().
So, these two numbers aren't directly comparable.

Comparing a kernbench workload:

# Half load (-j 32)

PREEMPT_AUTO PREEMPT_DYNAMIC

wall 74.41 +- 0.45 ( +- 0.60% ) 74.20 +- 0.33 sec ( +- 0.45% )
utime 1419.78 +- 2.04 ( +- 0.14% ) 1416.40 +- 6.07 sec ( +- 0.42% )
stime 247.70 +- 0.88 ( +- 0.35% ) 246.23 +- 1.20 sec ( +- 0.49% )
%cpu 2240.20 +- 16.03 ( +- 0.71% ) 2240.20 +- 19.34 ( +- 0.86% )
inv-csw 13056.00 +- 427.58 ( +- 3.27% ) 18750.60 +- 771.21 ( +- 4.11% )
vol-csw 191000.00 +- 1623.25 ( +- 0.84% ) 182857.00 +- 2373.12 ( +- 1.29% )

The runtimes are basically identical for both of these. Voluntary
context switches, as above (and in the optimal, maximal runs below),
are higher. Which as mentioned above, does add up.

However, unlike the sched-messaging workload, the involuntary
context-switches are generally lower (also true for the optimal,
maximal runs.) One reason for that might be that kbuild spends
~20% time executing in the kernel, while sched-messaging spends ~95%
time in the kernel. Which means a greater likelihood of being
preempted due to exceeding its time quanta.

Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Juri Lelli <juri.lelli@xxxxxxxxxx>
Cc: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
Cc: Peter Ziljstra <peterz@xxxxxxxxxxxxx>
Originally-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/lkml/87jzshhexi.ffs@tglx/
Signed-off-by: Ankur Arora <ankur.a.arora@xxxxxxxxxx>
---
kernel/sched/core.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2bc7f636267d..c3ba33c77053 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8983,7 +8983,9 @@ static void __sched_dynamic_update(int mode)
{
switch (mode) {
case preempt_dynamic_none:
- preempt_dynamic_mode = preempt_dynamic_undefined;
+ if (mode != preempt_dynamic_mode)
+ pr_info("%s: none\n", PREEMPT_MODE);
+ preempt_dynamic_mode = mode;
break;

case preempt_dynamic_voluntary:
--
2.31.1