Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

From: Valentin Schneider
Date: Mon Dec 18 2023 - 13:07:28 EST


On 07/12/23 15:47, Benjamin Segall wrote:
> Alright, this took longer than I intended, but here's the basic version
> with the duplicate "runqueue" just being a list rather than actual EEVDF
> or anything sensible.
>
> I also wasn't actually able to adapt it to work via task_work like I
> thought I could; flagging the entry to the relevant schedule() calls is
> too critical.
>

Thanks for sharing this! Have a first set of comments.

> --------
>
> Subject: [RFC PATCH] sched/fair: only throttle CFS tasks in userspace
>
> The basic idea of this implementation is to maintain duplicate runqueues
> in each cfs_rq that contain duplicate pointers to sched_entitys which
> should bypass throttling. Then we can skip throttling cfs_rqs that have
> any such children, and when we pick inside any not-actually-throttled
> cfs_rq, we only look at this duplicated list.
>
> "Which tasks should bypass throttling" here is "all schedule() calls
> that don't set a special flag", but could instead involve the lockdep
> markers (except for the problem of percpu-rwsem and similar) or explicit
> flags around syscalls and faults, or something else.
>
> This approach avoids any O(tasks) loops, but leaves partially-throttled
> cfs_rqs still contributing their full h_nr_running to their parents,
> which might result in worse balancing. Also it adds more (generally
> still small) overhead to the common enqueue/dequeue/pick paths.
>
> The very basic debug test added is to run a cpusoaker and "cat
> /sys/kernel/debug/sched_locked_spin" pinned to the same cpu in the same
> cgroup with a quota < 1 cpu.

So if I'm trying to compare notes:

The dual tree:
- Can actually throttle cfs_rq's once all tasks are in userspace
- Adds (light-ish) overhead to enqueue/dequeue
- Doesn't keep *nr_running updated when not fully throttled

Whereas dequeue via task_work:
- Can never fully throttle a cfs_rq
- Adds (not so light) overhead to unthrottle
- Keeps *nr_running updated

My thoughts right now are that there might be a way to mix both worlds: go
with the dual tree, but subtract from h_nr_running at dequeue_kernel()
(sort of doing the count update in throttle_cfs_rq() early) and keep track
of in-user tasks via handle_kernel_task_prev() to add this back when
unthrottling a not-fully-throttled cfs_rq. I need to actually think about
it though :-)

> @@ -154,11 +154,11 @@ static unsigned long exit_to_user_mode_loop(struct pt_regs *regs,
> while (ti_work & EXIT_TO_USER_MODE_WORK) {
>
> local_irq_enable_exit_to_user(ti_work);
>
> if (ti_work & _TIF_NEED_RESCHED)
> - schedule();
> + schedule_usermode(); /* TODO: also all of the arch/* loops that don't use this yet */
^^
Interestingly -Werror=comment trips on this

>
> if (ti_work & _TIF_UPROBE)
> uprobe_notify_resume(regs);
>
> if (ti_work & _TIF_PATCH_PENDING)
> @@ -5416,18 +5426,38 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> se->prev_sum_exec_runtime = se->sum_exec_runtime;
> }
>
> /*
> * Pick the next process, keeping these things in mind, in this order:
> - * 1) keep things fair between processes/task groups
> - * 2) pick the "next" process, since someone really wants that to run
> - * 3) pick the "last" process, for cache locality
> - * 4) do not run the "skip" process, if something else is available
> + * 1) If we're inside a throttled cfs_rq, only pick threads in the kernel
> + * 2) keep things fair between processes/task groups
> + * 3) pick the "next" process, since someone really wants that to run
> + * 4) pick the "last" process, for cache locality
> + * 5) do not run the "skip" process, if something else is available
> */
> static struct sched_entity *
> -pick_next_entity(struct cfs_rq *cfs_rq)
> +pick_next_entity(struct cfs_rq *cfs_rq, bool throttled)
> {
> +#ifdef CONFIG_CFS_BANDWIDTH
> + /*
> + * TODO: This might trigger, I'm not sure/don't remember. Regardless,
> + * while we do not explicitly handle the case where h_kernel_running
> + * goes to 0, we will call account/check_cfs_rq_runtime at worst in
> + * entity_tick and notice that we can now properly do the full
> + * throttle_cfs_rq.
> + */
> + WARN_ON_ONCE(list_empty(&cfs_rq->kernel_children));

FWIW this triggers pretty much immediately in my environment (buildroot
over QEMU).

> @@ -5712,16 +5748,26 @@ static int tg_throttle_down(struct task_group *tg, void *data)
> cfs_rq->throttle_count++;
>
> return 0;
> }
>
> +static void enqueue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
> +static void dequeue_kernel(struct cfs_rq *cfs_rq, struct sched_entity *se, int count);
> +
> static bool throttle_cfs_rq(struct cfs_rq *cfs_rq)
> {
> struct rq *rq = rq_of(cfs_rq);
> struct cfs_bandwidth *cfs_b = tg_cfs_bandwidth(cfs_rq->tg);
> struct sched_entity *se;
> - long task_delta, idle_task_delta, dequeue = 1;
> + long task_delta, idle_task_delta, kernel_delta, dequeue = 1;
> +
> + /*
> + * We don't actually throttle, though account() will have made sure to
> + * resched us so that we pick into a kernel task.
> + */
> + if (cfs_rq->h_kernel_running)
> + return false;
>

So as long as a cfs_rq has at least one task executing in the kernel, we
keep its *nr_running intact. We do throttle the whole thing once all the
tasks are in userspace (or about to be anyway), and unthrottle_on_enqueue()
undoes that.

> @@ -8316,15 +8471,44 @@ static void check_preempt_wakeup_fair(struct rq *rq, struct task_struct *p, int
>
> preempt:
> resched_curr(rq);
> }
>
> +static void handle_kernel_task_prev(struct task_struct *prev)
> +{
> +#ifdef CONFIG_CFS_BANDWIDTH
> + struct sched_entity *se = &prev->se;
> + bool new_kernel = is_kernel_task(prev);
> + bool prev_kernel = !list_empty(&se->kernel_node);
> + /*
> + * These extra loops are bad and against the whole point of the merged
> + * PNT, but it's a pain to merge, particularly since we want it to occur
> + * before check_cfs_runtime.
> + */
> + if (prev_kernel && !new_kernel) {
> + WARN_ON_ONCE(!se->on_rq); /* dequeue should have removed us */
> + for_each_sched_entity(se) {
> + dequeue_kernel(cfs_rq_of(se), se, 1);
> + if (cfs_rq_throttled(cfs_rq_of(se)))
> + break;
> + }
> + } else if (!prev_kernel && new_kernel && se->on_rq) {
> + for_each_sched_entity(se) {
> + enqueue_kernel(cfs_rq_of(se), se, 1);
> + if (cfs_rq_throttled(cfs_rq_of(se)))
> + break;
> + }
> + }
> +#endif
> +}


I'm trying to grok what the implications of a second tasks_timeline would
be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s
__enqueue_entity(), but that second tree doesn't follow the same
enqueue/dequeue cycle as the first one. Would a __dequeue_entity() +
__enqueue_entity() to force a rebalance make sense? That'd also take care
of updating the min_vruntime.