Re: [PATCH] lib/cpumask: Boot option to disable tasks distribution within cpumask

From: Yury Norov
Date: Tue Apr 30 2024 - 14:23:17 EST


On Tue, Apr 30, 2024 at 02:34:31PM +0530, Ankit Jain wrote:
> commit 46a87b3851f0 ("sched/core: Distribute tasks within affinity masks")
> and commit 14e292f8d453 ("sched,rt: Use cpumask_any*_distribute()")
> introduced the logic to distribute the tasks within cpumask upon initial
> wakeup.

So let's add the authors in CC list?

> For Telco RAN deployments, isolcpus are a necessity to cater to
> the requirement of low latency applications. These isolcpus are generally
> tickless so that high priority SCHED_FIFO tasks can execute without any
> OS jitter. Since load balancing is disabled on isocpus, any task
> which gets placed on these CPUs can not be migrated on its own.
> For RT applications to execute on isolcpus, a guaranteed kubernetes pod
> with all isolcpus becomes the requirement and these RT applications are
> affine to execute on a specific isolcpu within the kubernetes pod.
> However, there may be some non-RT tasks which could also schedule in the
> same kubernetes pod without being affine to any specific CPU(inherits the
> pod cpuset affinity).

OK... It looks like adding scheduler maintainers is also a necessity to
cater here...

> With multiple spawning and running containers inside
> the pod, container runtime spawns several non-RT initializing tasks
> ("runc init") inside the pod and due to above mentioned commits, these
> non-RT tasks may get placed on any isolcpus and may starve if it happens
> to wakeup on the same CPU as SCHED_FIFO task because RT throttling is also
> disabled in telco setup. Thus, RAN deployment fails and eventually leads
> to system hangs.

Not that I'm familiar to your setup, but this sounds like a userspace
configuration problems. Can you try to move your non-RT tasks into a
cgroup attached to non-RT CPUs, or something like that?

> With the introduction of kernel cmdline param 'sched_pick_firstcpu',
> there is an option provided for such usecases to disable the distribution
> of tasks within the cpumask logic and use the previous 'pick first cpu'
> approach for initial placement of tasks. Because many telco vendors
> configure the system in such a way that the first cpu within a cpuset
> of pod doesn't run any SCHED_FIFO or High priority tasks.
>
> Co-developed-by: Alexey Makhalov <alexey.makhalov@xxxxxxxxxxxx>
> Signed-off-by: Alexey Makhalov <alexey.makhalov@xxxxxxxxxxxx>
> Signed-off-by: Ankit Jain <ankit-aj.jain@xxxxxxxxxxxx>
> ---
> lib/cpumask.c | 24 ++++++++++++++++++++++++
> 1 file changed, 24 insertions(+)
>
> diff --git a/lib/cpumask.c b/lib/cpumask.c
> index e77ee9d46f71..3dea87d5ec1f 100644
> --- a/lib/cpumask.c
> +++ b/lib/cpumask.c
> @@ -154,6 +154,23 @@ unsigned int cpumask_local_spread(unsigned int i, int node)
> }
> EXPORT_SYMBOL(cpumask_local_spread);
>
> +/*
> + * Task distribution within the cpumask feature disabled?
> + */
> +static bool cpumask_pick_firstcpu __read_mostly;
> +
> +/*
> + * Disable Tasks distribution within the cpumask feature
> + */
> +static int __init cpumask_pick_firstcpu_setup(char *str)
> +{
> + cpumask_pick_firstcpu = 1;
> + pr_info("cpumask: Tasks distribution within cpumask is disabled.");
> + return 1;
> +}
> +
> +__setup("sched_pick_firstcpu", cpumask_pick_firstcpu_setup);
> +
> static DEFINE_PER_CPU(int, distribute_cpu_mask_prev);
>
> /**
> @@ -171,6 +188,13 @@ unsigned int cpumask_any_and_distribute(const struct cpumask *src1p,
> {
> unsigned int next, prev;
>
> + /*
> + * Don't distribute, if tasks distribution
> + * within cpumask feature is disabled
> + */
> + if (cpumask_pick_firstcpu)
> + return cpumask_any_and(src1p, src2p);

No, this is a wrong way.

To begin with, this parameter shouldn't control a single random
function. At least, the other cpumask_*_distribute() should be
consistent to the policy.

But in general... I don't think we should do things like that at all.
Cpumask API is a simple and plain wrapper around bitmaps. If you want
to modify a behavior of the scheduler, you could do that at scheduler
level, not in a random helper function.

Consider 2 cases:
- Someone unrelated to scheduler would use the same helper and will
be affected by this parameter inadvertently.
- Scheduler will switch to using another function to distribute CPUs,
and your setups will suddenly get broken again. This time deeply in
production.

Thanks,
Yury

> /* NOTE: our first selection will skip 0. */
> prev = __this_cpu_read(distribute_cpu_mask_prev);
>
> --
> 2.23.1