[tip: sched/core] sched/psi: Fix avgs_work re-arm in psi_avgs_work()

From: tip-bot2 for Chengming Zhou
Date: Fri Oct 28 2022 - 02:43:09 EST


The following commit has been merged into the sched/core branch of tip:

Commit-ID: 7d89d7bb921c5ae5a428df282e64ee5692e26fe0
Gitweb: https://git.kernel.org/tip/7d89d7bb921c5ae5a428df282e64ee5692e26fe0
Author: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>
AuthorDate: Mon, 10 Oct 2022 18:42:06 +08:00
Committer: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
CommitterDate: Thu, 27 Oct 2022 11:01:23 +02:00

sched/psi: Fix avgs_work re-arm in psi_avgs_work()

Pavan reported a problem that PSI avgs_work idle shutoff is not
working at all. Because PSI_NONIDLE condition would be observed in
psi_avgs_work()->collect_percpu_times()->get_recent_times() even if
only the kworker running avgs_work on the CPU.

Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
still will always re-arm the avgs_work, so shutoff is not working.

This patch changes to consider current CPU groupc as IDLE if the
kworker running avgs_work is the only task running and no IOWAIT
or MEMSTALL sleep tasks, in which case we will shut off the avgs_work
if other CPUs' groupc are also IDLE.

One potential problem is that the brief period of non-idle time
incurred between the aggregation run and the kworker's dequeue will
be stranded in the per-cpu buckets until avgs_work run next time.
The buckets can hold 4s worth of time, and future activity will wake
the avgs_work with a 2s delay, giving us 2s worth of data we can leave
behind when shut off the avgs_work. If the kworker run other works after
avgs_work shut off and doesn't have any scheduler activities for 2s,
this maybe a problem.

Reported-by: Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx>
Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>
Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
Link: https://lore.kernel.org/r/20221010104206.12184-1-zhouchengming@xxxxxxxxxxxxx
---
kernel/sched/psi.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index ee2ecc0..f4cdf6f 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -242,6 +242,8 @@ static void get_recent_times(struct psi_group *group, int cpu,
u32 *pchanged_states)
{
struct psi_group_cpu *groupc = per_cpu_ptr(group->pcpu, cpu);
+ int current_cpu = raw_smp_processor_id();
+ bool only_avgs_work = false;
u64 now, state_start;
enum psi_states s;
unsigned int seq;
@@ -256,6 +258,15 @@ static void get_recent_times(struct psi_group *group, int cpu,
memcpy(times, groupc->times, sizeof(groupc->times));
state_mask = groupc->state_mask;
state_start = groupc->state_start;
+ /*
+ * This CPU has only avgs_work kworker running, snapshot the
+ * newest times then don't need to re-arm for this groupc.
+ * Normally this kworker will sleep soon and won't wake
+ * avgs_work back up in psi_group_change().
+ */
+ if (current_cpu == cpu && groupc->tasks[NR_RUNNING] == 1 &&
+ !groupc->tasks[NR_IOWAIT] && !groupc->tasks[NR_MEMSTALL])
+ only_avgs_work = true;
} while (read_seqcount_retry(&groupc->seq, seq));

/* Calculate state time deltas against the previous snapshot */
@@ -280,6 +291,10 @@ static void get_recent_times(struct psi_group *group, int cpu,
if (delta)
*pchanged_states |= (1 << s);
}
+
+ /* Clear PSI_NONIDLE so avgs_work won't be re-armed for this groupc */
+ if (only_avgs_work)
+ *pchanged_states &= ~(1 << PSI_NONIDLE);
}

static void calc_avgs(unsigned long avg[3], int missed_periods,