Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode

From: K Prateek Nayak
Date: Thu Feb 20 2025 - 06:26:13 EST

Next message: Nikita Zhandarovich: "Re: [PATCH v2] media: usb: hackrf: fix device unregister order in hackrf_probe()"
Previous message: Hangbin Liu: "Re: [PATCH net 0/2] bond: fix xfrm offload feature during init"
In reply to: Peter Zijlstra: "Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode"
Next in thread: Peter Zijlstra: "Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hello Peter,

On 2/20/2025 4:25 PM, Peter Zijlstra wrote:

On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:

Proposed approach
=================

This approach builds on Ben Segall's proposal in [4] which marked the
task in schedule() when exiting to usermode by setting
"in_return_to_user" flag except this prototype takes it a step ahead and
marks a "kernel critical section" within the syscall boundary using a
per-task "kernel_cs_count".

The rationale behind this approach is that the task can only hold
kernel resources when running in kernel mode in preemptible context. In
this POC, the entire syscall boundary is marked as a kernel critical
section but in the future, the API can be used to mark fine grained
boundaries like between an up_read(), down_read() or up_write(),
down_write() pair.

Within a kernel critical section, throttling events are deferred until
the task's "kernel_cs_count" hits 0. Currently this count is an integer
to catch any cases where the count turns negative as a result of
oversights on my part but this could be changed to a preempt count like
mechanism to request a resched.

cfs_rq throttled picked again
v v

----|*********| (preempted by tick / wakeup) |***********| (full throttle)

^ ^
critical section cfs_rq is throttled partially critical section
entry since the task is still exit
runnable as it was preempted in
kernel critical section

The EEVDF infrastructure is extended to tracks the avg_vruntime and the
avg_load of only those entities preempted in kernel mode. When a cfs_rq
is throttled, it uses these metrics to select among the kernel mode
preempted tasks and running them till they exit to user mode.
pick_eevdf() is made aware that it is operating on a throttled hierarchy
to only select among these tasks that were preempted in kernel mode (and
the sched entities of cfs_rq that lead to them). When a throttled
entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
the hierarchy remains accessible for picking until that point.

root
/ \
A B * (throttled)
... / | \
0 1* 2*

(*) Preempted in kernel mode

o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
o avg_kcs_load = load(1) + load(2)

o throttled_vruntime_eligible:

entity preempted in kernel mode &&
entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime

o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
that propagates the smallest vruntime of the all the entities in
the subtree that are preempted in kernel mode. If they were
executing in usermode when preempted, this will be set to LLONG_MAX.

This is used to construct a min-heap and select through the
entities. Consider rbtree of B to look like:

1*
/ \
2* 0

min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
min_kcs_vruntime = min(se->min_kcs_vruntime,
__node_2_se(rb_left)->min_kcs_vruntime,
__node_2_se(rb_right)->min_kcs_vruntime);

pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
tree to first check the left subtree for eligibility, then the node
itself, and then the right subtree.

*groan*... why not actually dequeue the tasks and only retain those with
non-zero cs_count? That avoids having to duplicate everything, no?

The rationale there was with growing number of tasks on cfs_rq, the
throttle path has to perform a lot of dequeues and the unthrottle at
distribution has to enqueue all the dequeued threads back.

This is one way to keep all the tasks queued but allow pick to only
select among those that are preempted in kernel mode.

Since per-task throttling needs to tag, dequeue, and re-enqueue each
task, I'm putting this out as an alternate approach that does not
increase the complexities of tg_tree walks which Ben had noted on
Valentin's series [1]. Instead we retain the per cfs_rq throttling
at the cost of some stats tracking at enqueue and dequeue
boundaries.

If you have a strong feelings against any specific part, or the entirety
of this approach, please do let me know, and I'll do my best to see if
a tweaked approach or an alternate implementation can scale well with
growing thread counts (or at least try to defend the bits in question if
they hold merit still).

Any and all feedback is appreciated :)

[1] https://lore.kernel.org/lkml/xm26y15yz0q8.fsf@xxxxxxxxxx/

--
Thanks and Regards,
Prateek

Next message: Nikita Zhandarovich: "Re: [PATCH v2] media: usb: hackrf: fix device unregister order in hackrf_probe()"
Previous message: Hangbin Liu: "Re: [PATCH net 0/2] bond: fix xfrm offload feature during init"
In reply to: Peter Zijlstra: "Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode"
Next in thread: Peter Zijlstra: "Re: [RFC PATCH 00/22] sched/fair: Defer CFS throttling to exit to user mode"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]