[PATCH rfc 4/6] sched: cfs: add bpf hooks to control wakeup and tick preemption

From: Roman Gushchin
Date: Thu Sep 16 2021 - 12:35:03 EST


This patch adds 3 hooks to control wakeup and tick preemption:
cfs_check_preempt_tick
cfs_check_preempt_wakeup
cfs_wakeup_preempt_entity

The first one allows to force or suppress a preemption from a tick
context. An obvious usage example is to minimize the number of
non-voluntary context switches and decrease an associated latency
penalty by (conditionally) providing tasks or task groups an extended
execution slice. It can be used instead of tweaking
sysctl_sched_min_granularity.

The second one is called from the wakeup preemption code and allows
to redefine whether a newly woken task should preempt the execution
of the current task. This is useful to minimize a number of
preemptions of latency sensitive tasks. To some extent it's a more
flexible analog of a sysctl_sched_wakeup_granularity.

The third one is similar, but it tweaks the wakeup_preempt_entity()
function, which is called not only from a wakeup context, but also
from pick_next_task(), which allows to influence the decision on which
task will be running next.

It's a place for a discussion whether we need both these hooks or only
one of them: the second is more powerful, but depends more on the
current implementation. In any case, bpf hooks are not an ABI, so it's
not a deal breaker.

The idea of the wakeup_preempt_entity hook belongs to Rik van Riel. He
also contributed a lot to the whole patchset by proving his ideas,
recommendations and a feedback for earlier (non-public) versions.

Signed-off-by: Roman Gushchin <guro@xxxxxx>
---
include/linux/bpf_sched.h | 1 +
include/linux/sched_hook_defs.h | 4 +++-
kernel/sched/fair.c | 27 +++++++++++++++++++++++++++
3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf_sched.h b/include/linux/bpf_sched.h
index 6e773aecdff7..5c238aeb853c 100644
--- a/include/linux/bpf_sched.h
+++ b/include/linux/bpf_sched.h
@@ -40,6 +40,7 @@ static inline RET bpf_sched_##NAME(__VA_ARGS__) \
{ \
return DEFAULT; \
}
+#include <linux/sched_hook_defs.h>
#undef BPF_SCHED_HOOK

static inline bool bpf_sched_enabled(void)
diff --git a/include/linux/sched_hook_defs.h b/include/linux/sched_hook_defs.h
index 14344004e335..f075b32698cd 100644
--- a/include/linux/sched_hook_defs.h
+++ b/include/linux/sched_hook_defs.h
@@ -1,2 +1,4 @@
/* SPDX-License-Identifier: GPL-2.0 */
-BPF_SCHED_HOOK(int, 0, dummy, void)
+BPF_SCHED_HOOK(int, 0, cfs_check_preempt_tick, struct sched_entity *curr, unsigned long delta_exec)
+BPF_SCHED_HOOK(int, 0, cfs_check_preempt_wakeup, struct task_struct *curr, struct task_struct *p)
+BPF_SCHED_HOOK(int, 0, cfs_wakeup_preempt_entity, struct sched_entity *curr, struct sched_entity *se)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ff69f245b939..35ea8911b25c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -21,6 +21,7 @@
* Copyright (C) 2007 Red Hat, Inc., Peter Zijlstra
*/
#include "sched.h"
+#include <linux/bpf_sched.h>

/*
* Targeted preemption latency for CPU-bound tasks:
@@ -4447,6 +4448,16 @@ check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)

ideal_runtime = sched_slice(cfs_rq, curr);
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
+
+ if (bpf_sched_enabled()) {
+ int ret = bpf_sched_cfs_check_preempt_tick(curr, delta_exec);
+
+ if (ret < 0)
+ return;
+ else if (ret > 0)
+ resched_curr(rq_of(cfs_rq));
+ }
+
if (delta_exec > ideal_runtime) {
resched_curr(rq_of(cfs_rq));
/*
@@ -7083,6 +7094,13 @@ wakeup_preempt_entity(struct sched_entity *curr, struct sched_entity *se)
{
s64 gran, vdiff = curr->vruntime - se->vruntime;

+ if (bpf_sched_enabled()) {
+ int ret = bpf_sched_cfs_wakeup_preempt_entity(curr, se);
+
+ if (ret)
+ return ret;
+ }
+
if (vdiff <= 0)
return -1;

@@ -7168,6 +7186,15 @@ static void check_preempt_wakeup(struct rq *rq, struct task_struct *p, int wake_
likely(!task_has_idle_policy(p)))
goto preempt;

+ if (bpf_sched_enabled()) {
+ int ret = bpf_sched_cfs_check_preempt_wakeup(current, p);
+
+ if (ret < 0)
+ return;
+ else if (ret > 0)
+ goto preempt;
+ }
+
/*
* Batch and idle tasks do not preempt non-idle tasks (their preemption
* is driven by the tick):
--
2.31.1