Re: [PATCH] psi: Use ONCPU state tracking machinery to detect reclaim

From: Johannes Weiner
Date: Wed Feb 10 2021 - 15:38:03 EST


On Wed, Feb 10, 2021 at 12:06:05PM +0800, Chengming Zhou wrote:
> Move the reclaim detection from the timer tick to the task state
> tracking machinery using the recently added ONCPU state. And we
> also add memstall state changes checking in the psi_task_switch()
> optimization to update the parents properly.
>
> Thanks to Johannes Weiner for pointing out the psi_task_switch()
> optimization things and the clearer changelog.
>
> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
> Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>

Thanks for the update, Chengming.

It would be good to include a rationale in the changelog that explains
why we're doing this. Performance and cost are a bit questionable, IMO
because it doesn't look cheaper in aggregate...

> ---
> include/linux/psi.h | 1 -
> kernel/sched/core.c | 1 -
> kernel/sched/psi.c | 52 ++++++++++++++++------------------------------------
> kernel/sched/stats.h | 9 ---------
> 4 files changed, 16 insertions(+), 47 deletions(-)

...but the code is simpler and shorter this way: fewer lines, and
we're removing one of the hooks into the scheduler. So it's a
maintainability win, I would say.

I did some testing with perf bench. The new code appears to have
slightly more overhead (which is expected), although the error bars
overlap to a point where I don't think it would matter on real loads.

I tested an additional version of the code that adds unlikely()
annotations to move the pressure state branches out of line - since
they are after all exceptional situations. It seems to help -
especially the pipe bench actually looks better than on vanilla.

Attached the incremental patch below.

---

perf stat -r 3 -- perf sched bench messaging -l 10000

vanilla
125,833.65 msec task-clock:u # 22.102 CPUs utilized ( +- 1.94% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
69,526 page-faults:u # 0.553 K/sec ( +- 0.79% )
8,189,667,649 cycles:u # 0.065 GHz ( +- 1.49% ) (83.31%)
2,184,284,296 stalled-cycles-frontend:u # 26.67% frontend cycles idle ( +- 4.37% ) (83.32%)
1,152,703,719 stalled-cycles-backend:u # 14.08% backend cycles idle ( +- 0.56% ) (83.37%)
2,483,312,679 instructions:u # 0.30 insn per cycle
# 0.88 stalled cycles per insn ( +- 0.15% ) (83.33%)
781,332,174 branches:u # 6.209 M/sec ( +- 0.13% ) (83.35%)
159,531,476 branch-misses:u # 20.42% of all branches ( +- 0.17% ) (83.32%)

5.6933 +- 0.0911 seconds time elapsed ( +- 1.60% )

patched
129,756.92 msec task-clock:u # 22.243 CPUs utilized ( +- 1.92% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
69,904 page-faults:u # 0.539 K/sec ( +- 0.67% )
8,518,161,670 cycles:u # 0.066 GHz ( +- 2.19% ) (83.30%)
2,337,165,666 stalled-cycles-frontend:u # 27.44% frontend cycles idle ( +- 5.47% ) (83.32%)
1,148,789,343 stalled-cycles-backend:u # 13.49% backend cycles idle ( +- 0.05% ) (83.35%)
2,483,527,911 instructions:u # 0.29 insn per cycle
# 0.94 stalled cycles per insn ( +- 0.18% ) (83.38%)
782,138,388 branches:u # 6.028 M/sec ( +- 0.09% ) (83.33%)
160,131,311 branch-misses:u # 20.47% of all branches ( +- 0.16% ) (83.31%)

5.834 +- 0.106 seconds time elapsed ( +- 1.81% )

patched-unlikely
127,437.78 msec task-clock:u # 22.184 CPUs utilized ( +- 0.74% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
70,063 page-faults:u # 0.550 K/sec ( +- 0.53% )
8,453,581,973 cycles:u # 0.066 GHz ( +- 1.49% ) (83.34%)
2,327,192,242 stalled-cycles-frontend:u # 27.53% frontend cycles idle ( +- 2.43% ) (83.32%)
1,146,196,558 stalled-cycles-backend:u # 13.56% backend cycles idle ( +- 0.35% ) (83.34%)
2,486,920,732 instructions:u # 0.29 insn per cycle
# 0.94 stalled cycles per insn ( +- 0.10% ) (83.34%)
781,067,666 branches:u # 6.129 M/sec ( +- 0.15% ) (83.34%)
160,104,212 branch-misses:u # 20.50% of all branches ( +- 0.10% ) (83.33%)

5.7446 +- 0.0418 seconds time elapsed ( +- 0.73% )

---

perf stat -r 3 -- perf bench sched pipe

vanilla
14,086.14 msec task-clock:u # 1.009 CPUs utilized ( +- 6.52% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,467 page-faults:u # 0.104 K/sec ( +- 0.06% )
306,181,835 cycles:u # 0.022 GHz ( +- 2.13% ) (83.41%)
43,975,811 stalled-cycles-frontend:u # 14.36% frontend cycles idle ( +- 14.45% ) (83.05%)
52,429,386 stalled-cycles-backend:u # 17.12% backend cycles idle ( +- 0.28% ) (83.58%)
93,097,176 instructions:u # 0.30 insn per cycle
# 0.56 stalled cycles per insn ( +- 0.36% ) (83.23%)
35,351,661 branches:u # 2.510 M/sec ( +- 0.21% ) (83.37%)
6,124,932 branch-misses:u # 17.33% of all branches ( +- 0.51% ) (83.36%)

13.955 +- 0.164 seconds time elapsed ( +- 1.17% )

patched
14,574.69 msec task-clock:u # 1.040 CPUs utilized ( +- 0.87% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,469 page-faults:u # 0.101 K/sec ( +- 0.13% )
302,769,739 cycles:u # 0.021 GHz ( +- 1.19% ) (83.17%)
37,638,522 stalled-cycles-frontend:u # 12.43% frontend cycles idle ( +- 0.31% ) (83.47%)
46,206,055 stalled-cycles-backend:u # 15.26% backend cycles idle ( +- 6.56% ) (83.34%)
92,566,358 instructions:u # 0.31 insn per cycle
# 0.50 stalled cycles per insn ( +- 0.51% ) (83.45%)
35,667,707 branches:u # 2.447 M/sec ( +- 0.67% ) (83.23%)
6,224,587 branch-misses:u # 17.45% of all branches ( +- 2.24% ) (83.35%)

14.010 +- 0.245 seconds time elapsed ( +- 1.75% )

patched-unlikely
13,470.99 msec task-clock:u # 1.024 CPUs utilized ( +- 3.10% )
0 context-switches:u # 0.000 K/sec
0 cpu-migrations:u # 0.000 K/sec
1,477 page-faults:u # 0.110 K/sec ( +- 0.09% )
310,752,740 cycles:u # 0.023 GHz ( +- 1.32% ) (83.35%)
44,894,078 stalled-cycles-frontend:u # 14.45% frontend cycles idle ( +- 13.24% ) (83.47%)
52,540,903 stalled-cycles-backend:u # 16.91% backend cycles idle ( +- 0.36% ) (82.96%)
92,296,178 instructions:u # 0.30 insn per cycle
# 0.57 stalled cycles per insn ( +- 0.48% ) (83.44%)
35,316,802 branches:u # 2.622 M/sec ( +- 0.06% ) (83.32%)
6,173,049 branch-misses:u # 17.48% of all branches ( +- 0.66% ) (83.47%)

13.161 +- 0.293 seconds time elapsed ( +- 2.22% )

> @@ -833,7 +827,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> */
> iter = NULL;
> while ((group = iterate_groups(next, &iter))) {
> - if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
> + next->in_memstall == prev->in_memstall) {
> common = group;
> break;

It'd be better to compare psi_flags instead of just in_memstall: it's
clearer and also more robust against future changes (even though it's
somewhat unlikely we grow more states). It's also an invariant
throughout the loop, so we should move it out.

The comment above the loop is now stale too.

Can you fold the following into your patch?

diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
index 8735d5f291dc..6d4a246ef131 100644
--- a/kernel/sched/psi.c
+++ b/kernel/sched/psi.c
@@ -809,18 +809,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
void *iter;

if (next->pid) {
+ bool identical_state;
+
psi_flags_change(next, 0, TSK_ONCPU);
/*
- * When moving state between tasks, the group that
- * contains them both does not change: we can stop
- * updating the tree once we reach the first common
- * ancestor. Iterate @next's ancestors until we
- * encounter @prev's state.
+ * When switching between tasks that have an identical
+ * runtime state, the cgroup that contains both tasks
+ * does not change: we can stop updating the tree once
+ * we reach the first common ancestor. Iterate @next's
+ * ancestors only until we encounter @prev's ONCPU.
*/
+ identical_state = prev->psi_flags == next->psi_flags;
iter = NULL;
while ((group = iterate_groups(next, &iter))) {
- if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
- next->in_memstall == prev->in_memstall) {
+ if (identical_state &&
+ per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
common = group;
break;
}

Otherwise, this looks good to me. Peter, what do you think?

---