[PATCH] lib/cpumask: Boot option to disable tasks distribution within cpumask

From: Ankit Jain
Date: Tue Apr 30 2024 - 05:05:02 EST


commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
introduced the logic to distribute the tasks within cpumask upon initial
wakeup. For Telco RAN deployments, isolcpus are a necessity to cater to
the requirement of low latency applications. These isolcpus are generally
tickless so that high priority SCHED_FIFO tasks can execute without any
OS jitter. Since load balancing is disabled on isocpus, any task
which gets placed on these CPUs can not be migrated on its own.
For RT applications to execute on isolcpus, a guaranteed kubernetes pod
with all isolcpus becomes the requirement and these RT applications are
affine to execute on a specific isolcpu within the kubernetes pod.
However, there may be some non-RT tasks which could also schedule in the
same kubernetes pod without being affine to any specific CPU(inherits the
pod cpuset affinity). With multiple spawning and running containers inside
the pod, container runtime spawns several non-RT initializing tasks
("runc init") inside the pod and due to above mentioned commits, these
non-RT tasks may get placed on any isolcpus and may starve if it happens
to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
disabled in telco setup. Thus, RAN deployment fails and eventually leads
to system hangs.

With the introduction of kernel cmdline param 'sched_pick_firstcpu',
there is an option provided for such usecases to disable the distribution
of tasks within the cpumask logic and use the previous 'pick first cpu'
approach for initial placement of tasks. Because many telco vendors
configure the system in such a way that the first cpu within a cpuset
of pod doesn't run any SCHED_FIFO or High priority tasks.

Co-developed-by: Alexey Makhalov <alexey.makhalov@xxxxxxxxxxxx>
Signed-off-by: Alexey Makhalov <alexey.makhalov@xxxxxxxxxxxx>
Signed-off-by: Ankit Jain <ankit-aj.jain@xxxxxxxxxxxx>
---
lib/cpumask.c | 24 ++++++++++++++++++++++++
1 file changed, 24 insertions(+)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index e77ee9d46f71..3dea87d5ec1f 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
}
EXPORT_SYMBOL(cpumask_local_spread);

+/*
+ * Task distribution within the cpumask feature disabled?
+ */
+static bool cpumask_pick_firstcpu __read_mostly;
+
+/*
+ * Disable Tasks distribution within the cpumask feature
+ */
+static int __init cpumask_pick_firstcpu_setup(char *str)
+{
+ cpumask_pick_firstcpu = 1;
+ pr_info("cpumask: Tasks distribution within cpumask is disabled.");
+ return 1;
+}
+
+__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
+
static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);

/**
@@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
{
unsigned int next, prev;

+ /*
+ * Don't distribute, if tasks distribution
+ * within cpumask feature is disabled
+ */
+ if (cpumask_pick_firstcpu)
+ return cpumask_any_and(src1p, src2p);
+
/* NOTE: our first selection will skip 0. */
prev = __this_cpu_read(distribute_cpu_mask_prev);

--
2.23.1