Re: [RFC PATCH 1/2] sched/fair: Only throttle CFS tasks on return to userspace

From: Valentin Schneider
Date: Tue Dec 19 2023 - 06:51:09 EST


On 18/12/23 14:34, Benjamin Segall wrote:
> Valentin Schneider <vschneid@xxxxxxxxxx> writes:
>>
>> So if I'm trying to compare notes:
>>
>> The dual tree:
>> - Can actually throttle cfs_rq's once all tasks are in userspace
>> - Adds (light-ish) overhead to enqueue/dequeue
>> - Doesn't keep *nr_running updated when not fully throttled
>>
>> Whereas dequeue via task_work:
>> - Can never fully throttle a cfs_rq
>> - Adds (not so light) overhead to unthrottle
>> - Keeps *nr_running updated
>
> Yeah, this sounds right. (Though for the task_work version once all the
> tasks are throttled, the cfs_rq is dequeued like normal, so it doesn't
> seem like a big deal to me)
>

In my implementation the cfs_rq's can be dequeued once all of their tasks
have been throttled, but the tasks aren't migrated back onto the cfs_rq's
until the unthrottle (rather than at the "final" throttle).

That is done because if a random task gets migrated to a throttled & empty
cfs_rq, we still need to schedule it 'till it exits the kernel, and we
don't want the now-in-userspace task to suddenly become eligible for
picking.

That adds these O(ntasks) loops in the unthrottle to re-shape the
cfs_rq's. The dual tree approach saves us from doing all that, IIUC
unthrottle_on_enqueue() only puts the se's in the task's direct hierarchy
back onto the second tree for them to be picked.


>>
>> My thoughts right now are that there might be a way to mix both worlds: go
>> with the dual tree, but subtract from h_nr_running at dequeue_kernel()
>> (sort of doing the count update in throttle_cfs_rq() early) and keep track
>> of in-user tasks via handle_kernel_task_prev() to add this back when
>> unthrottling a not-fully-throttled cfs_rq. I need to actually think about
>> it though :-)
>
> Probably, but I had tons of trouble getting the accounting correct as it
> was :P
>

Yeah so did I, cgroups are "fun".

>>
>> I'm trying to grok what the implications of a second tasks_timeline would
>> be on picking - we'd want a RBtree update equivalent to put_prev_entity()'s
>> __enqueue_entity(), but that second tree doesn't follow the same
>> enqueue/dequeue cycle as the first one. Would a __dequeue_entity() +
>> __enqueue_entity() to force a rebalance make sense? That'd also take care
>> of updating the min_vruntime.
>
> Yeah, the most sensible thing to do would probably be to just have curr
> not be in the queue, the same as the normal rbtree, and save the "on
> kernel list" as an on_rq-equivalent bool instead of as !list_empty(kernel_node).

Makes sense, I'll have a go at this.