Re: [PATCH v4 0/6] sched_ext: Support high-performance monotonically non-decreasing clock

From: Andrea Righi
Date: Mon Dec 09 2024 - 04:53:59 EST


On Mon, Dec 09, 2024 at 03:15:25PM +0900, Changwoo Min wrote:
> Many BPF schedulers (such as scx_central, scx_lavd, scx_rusty, scx_bpfland,
> and scx_flash) frequently call bpf_ktime_get_ns() for tracking tasks' runtime
> properties. If supported, bpf_ktime_get_ns() eventually reads a hardware
> timestamp counter (TSC). However, reading a hardware TSC is not
> performant in some hardware platforms, degrading IPC.
>
> This patchset addresses the performance problem of reading hardware TSC
> by leveraging the rq clock in the scheduler core, introducing a
> scx_bpf_now_ns() function for BPF schedulers. Whenever the rq clock
> is fresh and valid, scx_bpf_now_ns() provides the rq clock, which is
> already updated by the scheduler core (update_rq_clock), so it can reduce
> reading the hardware TSC.
>
> When the rq lock is released (rq_unpin_lock), the rq clock is invalidated,
> so a subsequent scx_bpf_now_ns() call gets the fresh sched_clock for the caller.
>
> In addition, scx_bpf_now_ns() guarantees the clock is monotonically
> non-decreasing for the same CPU, so the clock cannot go backward
> in the same CPU.
>
> Using scx_bpf_now_ns() reduces the number of reading hardware TSC
> by 40-70% (65% for scx_lavd, 58% for scx_bpfland, and 43% for scx_rusty)
> for the following benchmark:
>
> perf bench -f simple sched messaging -t -g 20 -l 6000
>
> The patchset begins by managing the status of rq clock in the scheduler
> core, then implementing scx_bpf_now_ns(), and finally applying it to the
> BPF schedulers.

I left a few comments, but overall it looks good to me. I also ran some
tests with this applied and a modified scx_bpfland to use the new
scx_bpf_now_ns(), no issue to report, therefore:

Acked-by: Andrea Righi <arighi@xxxxxxxxxx>

>
> ChangwLog v3 -> v4:
> - Separate the code relocation related to scx_enabled() into a
> separate patch.
> - Remove scx_rq_clock_stale() after (or before) ops.running() and
> ops.update_idle() calls
> - Rename scx_bpf_clock_get_ns() into scx_bpf_now_ns() and revise it to
> address the comments
> - Move the per-CPU variable holding a prev clock into scx_rq
> (rq->scx.prev_clock)
> - Add a comment describing when the clock could go backward in
> scx_bpf_now_ns()
> - Rebase the code to the tip of Tejun's sched_ext repo (for-next
> branch)
>
> ChangeLog v2 -> v3:
> - To avoid unnecessarily modifying cache lines, scx_rq_clock_update()
> and scx_rq_clock_stale() update the clock and flags only when a
> sched_ext scheduler is enabled.
>
> ChangeLog v1 -> v2:
> - Rename SCX_RQ_CLK_UPDATED to SCX_RQ_CLK_VALID to denote the validity
> of an rq clock clearly.
> - Rearrange the clock and flags fields in struct scx_rq to make sure
> they are in the same cacheline to minimize the cache misses
> - Add an additional explanation to the commit message in the 2/5 patch
> describing when the rq clock will be reused with an example.
> - Fix typos
> - Rebase the code to the tip of Tejun's sched_ext repo
>
> Changwoo Min (6):
> sched_ext: Relocate scx_enabled() related code
> sched_ext: Implement scx_rq_clock_update/stale()
> sched_ext: Manage the validity of scx_rq_clock
> sched_ext: Implement scx_bpf_now_ns()
> sched_ext: Add scx_bpf_now_ns() for BPF scheduler
> sched_ext: Replace bpf_ktime_get_ns() to scx_bpf_now_ns()
>
> kernel/sched/core.c | 6 +-
> kernel/sched/ext.c | 73 ++++++++++++++++++++++++
> kernel/sched/sched.h | 52 ++++++++++++-----
> tools/sched_ext/include/scx/common.bpf.h | 1 +
> tools/sched_ext/include/scx/compat.bpf.h | 5 ++
> tools/sched_ext/scx_central.bpf.c | 4 +-
> tools/sched_ext/scx_flatcg.bpf.c | 2 +-
> 7 files changed, 124 insertions(+), 19 deletions(-)
>
> --
> 2.47.1
>

-Andrea