On Thu, Feb 20, 2025 at 09:32:35AM +0000, K Prateek Nayak wrote:
Proposed approach
=================
This approach builds on Ben Segall's proposal in [4] which marked the
task in schedule() when exiting to usermode by setting
"in_return_to_user" flag except this prototype takes it a step ahead and
marks a "kernel critical section" within the syscall boundary using a
per-task "kernel_cs_count".
The rationale behind this approach is that the task can only hold
kernel resources when running in kernel mode in preemptible context. In
this POC, the entire syscall boundary is marked as a kernel critical
section but in the future, the API can be used to mark fine grained
boundaries like between an up_read(), down_read() or up_write(),
down_write() pair.
Within a kernel critical section, throttling events are deferred until
the task's "kernel_cs_count" hits 0. Currently this count is an integer
to catch any cases where the count turns negative as a result of
oversights on my part but this could be changed to a preempt count like
mechanism to request a resched.
cfs_rq throttled picked again
v v
----|*********| (preempted by tick / wakeup) |***********| (full throttle)
^ ^
critical section cfs_rq is throttled partially critical section
entry since the task is still exit
runnable as it was preempted in
kernel critical section
The EEVDF infrastructure is extended to tracks the avg_vruntime and the
avg_load of only those entities preempted in kernel mode. When a cfs_rq
is throttled, it uses these metrics to select among the kernel mode
preempted tasks and running them till they exit to user mode.
pick_eevdf() is made aware that it is operating on a throttled hierarchy
to only select among these tasks that were preempted in kernel mode (and
the sched entities of cfs_rq that lead to them). When a throttled
entity's "kernel_cs_count" hits 0, the entire hierarchy is frozen but
the hierarchy remains accessible for picking until that point.
root
/ \
A B * (throttled)
... / | \
0 1* 2*
(*) Preempted in kernel mode
o avg_kcs_vruntime = entity_key(1) * load(1) + entity_key(2) * load(2)
o avg_kcs_load = load(1) + load(2)
o throttled_vruntime_eligible:
entity preempted in kernel mode &&
entity_key(<>) * avg_kcs_load <= avg_kcs_vruntime
o rbtree is augmented with a "min_kcs_vruntime" field in sched entity
that propagates the smallest vruntime of the all the entities in
the subtree that are preempted in kernel mode. If they were
executing in usermode when preempted, this will be set to LLONG_MAX.
This is used to construct a min-heap and select through the
entities. Consider rbtree of B to look like:
1*
/ \
2* 0
min_kcs_vruntime = (se_in_kernel()) ? se->vruntime : LLONG_MAX:
min_kcs_vruntime = min(se->min_kcs_vruntime,
__node_2_se(rb_left)->min_kcs_vruntime,
__node_2_se(rb_right)->min_kcs_vruntime);
pick_eevdf() uses the min_kcs_vruntime on the virtual deadline sorted
tree to first check the left subtree for eligibility, then the node
itself, and then the right subtree.
*groan*... why not actually dequeue the tasks and only retain those with
non-zero cs_count? That avoids having to duplicate everything, no?