Re: [PATCH] RFC: sched: Rework task_sched_runtime to avoid calling update_rq_clock

From: Peter Zijlstra
Date: Thu Jun 13 2024 - 06:05:00 EST


On Wed, Jun 12, 2024 at 06:58:26PM -0700, John Stultz wrote:
> I recently got a bug report that
> clock_gettime(CLOCK_THREAD_CPUTIME_ID,...) had regressed between
> 5.10 and 6.1. Its not a huge regression in absolute time
> (~30-40ns), but is >10% change.
>
> I narrowed the cause down to the addition of
> psi_account_irqtime() in update_rq_clock_task(), in commit
> 52b1364ba0b1 ("sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ
> pressure")
>
> So that explains the behavior change,

It doesn't really... that patch just feeds the irq_time we *already*
subtracted prior to it, to PSI, such that PSI can also enjoy the 'view'.

The only explanation I have is that interrupts that end up in the
scheduler (wakeups, tick, etc..) now get to do that PSI cgroup iteration
and that cost adds up to the IRQ time itself, and as such the task time
slows down accordingly.

Are you using silly deep cgroup hierarchies?

> but it also seems odd that
> we're doing psi irq accounting from a syscall that is just
> trying to read the thread's cputime.

In order to avoid doing all the accounting in the IRQ entry/exit paths,
those paths only do the bare minimum of tracking how much IRQ time there
is.

update_rq_clock_task() then looks at the increase of IRQ time and
subtracts this from the task time -- after all, all time spend in the
IRQ wasn't spend on the task itself.

It did that prior to the PSI patch, and it does so after. The only
change is it now feeds this delta into the PSI thing.

> NOTE: I'm not 100% sure this is correct yet. There may be some
> edge cases I've overlooked, so I'd greatly appreciate any
> review or feedback.

Urgh, you're sprinkling the details of what is clock_task over multiple
places.

Does something like so work?

---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0935f9d4bb7b..d4b87539d72a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -724,7 +724,6 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)

rq->prev_irq_time += irq_delta;
delta -= irq_delta;
- psi_account_irqtime(rq->curr, irq_delta);
delayacct_irq(rq->curr, irq_delta);
#endif
#ifdef CONFIG_PARAVIRT_TIME_ACCOUNTING
@@ -5459,6 +5458,8 @@ void sched_tick(void)

sched_clock_tick();

+ psi_account_irqtime(curr, &rq->psi_irq_time);
+
rq_lock(rq, &rf);

update_rq_clock(rq);
diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 146baa91d104..57fdb0b9efbd 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -991,12 +991,13 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
}

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
-void psi_account_irqtime(struct task_struct *task, u32 delta)
+void psi_account_irqtime(struct task_struct *task, u64 *prev)
{
int cpu = task_cpu(task);
struct psi_group *group;
struct psi_group_cpu *groupc;
- u64 now;
+ u64 now, irq;
+ s64 delta;

if (static_branch_likely(&psi_disabled))
return;
@@ -1005,6 +1006,11 @@ void psi_account_irqtime(struct task_struct *task, u32 delta)
return;

now = cpu_clock(cpu);
+ irq = irq_time_read(cpu);
+ delta = (s64)(irq - *prev);
+ if (delta < 0)
+ return;
+ *prev = irq;

group = task_psi_group(task);
do {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 62fd8bc6fd08..a63eb546bc4a 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1133,6 +1133,7 @@ struct rq {

#ifdef CONFIG_IRQ_TIME_ACCOUNTING
u64 prev_irq_time;
+ u64 psi_irq_time;
#endif
#ifdef CONFIG_PARAVIRT
u64 prev_steal_time;
diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h
index d1445410840a..1111f060264f 100644
--- a/kernel/sched/stats.h
+++ b/kernel/sched/stats.h
@@ -110,7 +110,7 @@ __schedstats_from_se(struct sched_entity *se)
void psi_task_change(struct task_struct *task, int clear, int set);
void psi_task_switch(struct task_struct *prev, struct task_struct *next,
bool sleep);
-void psi_account_irqtime(struct task_struct *task, u32 delta);
+void psi_account_irqtime(struct task_struct *task, u64 *prev);

/*
* PSI tracks state that persists across sleeps, such as iowaits and
@@ -192,7 +192,7 @@ static inline void psi_ttwu_dequeue(struct task_struct *p) {}
static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep) {}
-static inline void psi_account_irqtime(struct task_struct *task, u32 delta) {}
+static inline void psi_account_irqtime(struct task_struct *task, u64 *prev) {}
#endif /* CONFIG_PSI */

#ifdef CONFIG_SCHED_INFO