Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

From: Benjamin Segall
Date: Fri Dec 01 2023 - 17:09:52 EST


Valentin Schneider <vschneid@xxxxxxxxxx> writes:

> On 30/11/23 13:26, Benjamin Segall wrote:
>> Valentin Schneider <vschneid@xxxxxxxxxx> writes:
>>
>> The alternative we've been experimenting with (and still running into
>> other issues that have made it hard to tell if they work) is to still
>> leave the tasks on their cfs_rqs, but instead have two task_timelines or
>> similar per cfs_rq, one of which only has unthrottleable tasks (or
>> partially-throttled child cgroups) on it. Then when picking into a
>> partially-unthrottled cfs_rq you only look at that alternate task_timeline.
>>
>
> IIUC then you don't dequeue the cfs_rq's se, but instead rely on the RB
> tree switch to only have unthrottable tasks running.

Correct.

>
>> This means that we get to skip this per-actually-throttled-task loop:
>>
>>> @@ -5548,7 +5548,61 @@ static int tg_unthrottle_up(struct task_group *tg, void *data)
>>> {
>>> struct rq *rq = data;
>>> struct cfs_rq *cfs_rq = tg->cfs_rq[cpu_of(rq)];
>>> + struct sched_entity *se = tg->se[cpu_of(rq)];
>>> + struct cfs_rq *pcfs_rq = cfs_rq_of(se);
>>> + long task_delta = 0, idle_task_delta = 0;
>>> + struct task_struct *p, *tmp;
>>>
>>> + /*
>>> + * Re-enqueue the tasks that have been throttled at this level.
>>> + *
>>> + * The task count is up-propagated via ->unthrottled_*h_nr_running,
>>> + * as we can't purely rely on h_nr_running post-enqueue: the unthrottle
>>> + * might happen when a cfs_rq still has some tasks enqueued, either still
>>> + * making their way to userspace, or freshly migrated to it.
>>> + */
>>> + list_for_each_entry_safe(p, tmp, &cfs_rq->throttled_limbo_list, throttle_node) {
>>> + struct sched_entity *pse = &p->se;
>>> +
>>> + list_del_init(&p->throttle_node);
>>> + enqueue_entity(cfs_rq, pse, ENQUEUE_WAKEUP);
>>> + task_delta++;
>>> + idle_task_delta += task_has_idle_policy(p);
>>> + }
>>
>> The downsides are that you instead have extra operations per
>> enqueue/dequeue/pick (but just an extra list/rbtree operation or check),
>> and that it doesn't do *any* accounting change for a partially dequeued
>> cfs_rq.
>>
>
> I would assume we want to keep the *nr_running as close to reality as
> possible, given their impact on pick_next_task_fair() & load_balance().
>

Yeah, and while it's maybe ok for the longer-period load_balance, it's
definitely sketchy for the shorter term things like select_idle_sibling.
In theory we could duplicate more and more of the accounting... (though
that also then becomes questionable amounts of increased overhead on
enqueue/dequeue).

>> I'm going to try putting together a cleaner variant of our version that
>> works via task_work instead of bracketing every relevant entry point.
>> (That design came from when we were trying instead to only do it for
>> tasks holding actual locks)
>
> Interesting, thank you for sharing! I assume then the motivation for this
> is to reduce latencies caused by throttling lock holders?

Yeah, and then we ran into things like percpu-rwsem where lock/unlock
accounting doesn't work as well, but kept the basic design.