Re: [PATCHSET v6] sched: Implement BPF extensible scheduler class

From: Tejun Heo
Date: Tue May 07 2024 - 15:33:13 EST


Hello, Rik.

On Mon, May 06, 2024 at 02:47:47PM -0400, Rik van Riel wrote:
> I believe the issues that Paul pointed out with my flattened cgroup code
> are fixable. I ended up not getting back to this code because it took me a
> few months to think of ways to fix the issues Paul found, and by then I
> had moved on to other projects.
>
> For reference, Paul found these two (very real) issues with my
> implementation.
>
> 1) Thundering herd problem. If many tasks in a low priority cgroup wake
> up at the same time, they can end up swamping a CPU.

The way that scx_flatcg (which seems broken right now, will fix it) works
around this problem is by always doing two-level scheduling. ie. The top
level rbtree hosts the tasks in the root cgroup and all the active cgroups,
where each cgroup is scheduled according to their current flattened
hierarchical share. This seems to work pretty well as what becomes really
expensive is the repeated nesting which can easily go 6+ levels.

This doesn't solve the thundering herd problem completely but shifts it one
level. ie. Thundering herds of threads can be handled easily. However,
thunderding herds of cgroups can still cause unfairness. I can imagine cases
where this can lead to scheduling issues but they all seem pretty convoluted
and artificial. Sure, somebody who's intentionally adversarial can cause
temporary issues but perfect isolation of adversarial actors isn't what
cgroups can or should practically target.

Even in the case this is an actual issue, we can solve it by limited
nesting. cgroups already have deligation boundaries. Maybe it needs to be
made more explicit but one solution could be adding a nesting layer only on
delegation boundaries so that misbehaviors are better contained within each
delegation domain.

> I believe this can be solved with the same idea I had for
> reimplementing CONFIG_CFS_BANDWIDTH. Specifically, the code that
> determines the time slice length for a task already has a way to
> determine whether a CPU is "overloaded", and time slices need to be
> shortened. Once we reach that situation, we can place woken up tasks on
> a secondary heap of per-cgroup runqueues, from which we do not directly
> run tasks, but pick the lowest vruntime task from the lowest vruntime
> cgroup and put that on the main runqueue, if the previously running

When overloaded, are the cgroups being put on a single rbtree? If so, they'd
be using flattened shares, right? I wonder what you're suggesting for the
overloaded case is pretty simliar to what flatcg is doing plus avoiding one
level of indirection while not overloaded.

Thanks.

--
tejun