Re: sched_ext/for-6.11: cpu validity check in ops_cpu_valid

From: Tejun Heo
Date: Mon Jul 15 2024 - 01:17:42 EST


Hello, Vishal.

On Sun, Jul 14, 2024 at 12:44:24AM +0530, Vishal Chourasia wrote:
> Currently, the BPF scheduler can return a CPU that is marked as possible
> in the system configurations, but this doesn't guarantee that the CPU is
> actually present or online at the time. This behavior can lead to
> scenarios where the scheduler attempts to assign tasks to CPUs that are
> not available, causing the fallback mechanism to activate and
> potentially leading to an uneven load distribution across the system.

ops.select_cpu() is allowed to return any CPU and then the scheduler will
pick a fallback CPU. This is mostly because that's how
sched_class->select_task_rq() behaves. Here, SCX is just inheriting the
behavior.

Dispatching to foreign local DSQ using SCX_DSQ_LOCAL_ON also does
auto-fallback. This is because it's difficult for the BPF scheduler to
strongly synchronize its dispatch operation against CPU hotplug operations.

> By defalut, When a "not possible" CPU is returned, sched_ext gracefully
> exits the bpf scheduler.
>
> static bool ops_cpu_valid(s32 cpu, const char *where)
> {
> if (likely(cpu >= 0 && cpu < nr_cpu_ids && cpu_possible(cpu))) {
> return true;
> } else {
> scx_ops_error("invalid CPU %d%s%s", cpu,
> where ? " " : "", where ?: "");
> return false;
> }
> }
>
> On POWER, a system can have differences in cpu_present and cpu_possible
> mask. Not present, but possible CPUs can be added later but once added
> will also be marked set in the cpu present mask.
>
> Looks like cpu_present() is a better check.

We can consider tightening each path separately but I'm not sure making
ops_cpu_valid() more strict is a good idea. For example, there's no reason
to abort because a scheduler is calling scx_bpf_dsq_nr_queued() on an
offline CPU especially given that the kfunc is allowed from any context
without any synchronization. It can create aborting bugs which are really
difficult to reproduce.

Thanks.

--
tejun