Re: [External] Re: [PATCH] psi: Use ONCPU state tracking machinery to detect reclaim

From: Chengming Zhou
Date: Thu Feb 11 2021 - 07:50:52 EST


Hello Johannes,

在 2021/2/11 上午4:37, Johannes Weiner 写道:
> On Wed, Feb 10, 2021 at 12:06:05PM +0800, Chengming Zhou wrote:
>> Move the reclaim detection from the timer tick to the task state
>> tracking machinery using the recently added ONCPU state. And we
>> also add memstall state changes checking in the psi_task_switch()
>> optimization to update the parents properly.
>>
>> Thanks to Johannes Weiner for pointing out the psi_task_switch()
>> optimization things and the clearer changelog.
>>
>> Signed-off-by: Muchun Song <songmuchun@xxxxxxxxxxxxx>
>> Signed-off-by: Chengming Zhou <zhouchengming@xxxxxxxxxxxxx>
> Thanks for the update, Chengming.
>
> It would be good to include a rationale in the changelog that explains
> why we're doing this. Performance and cost are a bit questionable, IMO
> because it doesn't look cheaper in aggregate...

Yes, there maybe not obvious win in terms of performance and cost. But I think

the statistical time maybe a little more acurrate and the code become fewer. : )

>> ---
>> include/linux/psi.h | 1 -
>> kernel/sched/core.c | 1 -
>> kernel/sched/psi.c | 52 ++++++++++++++++------------------------------------
>> kernel/sched/stats.h | 9 ---------
>> 4 files changed, 16 insertions(+), 47 deletions(-)
> ...but the code is simpler and shorter this way: fewer lines, and
> we're removing one of the hooks into the scheduler. So it's a
> maintainability win, I would say.
>
> I did some testing with perf bench. The new code appears to have
> slightly more overhead (which is expected), although the error bars
> overlap to a point where I don't think it would matter on real loads.
>
> I tested an additional version of the code that adds unlikely()
> annotations to move the pressure state branches out of line - since
> they are after all exceptional situations. It seems to help -
> especially the pipe bench actually looks better than on vanilla.

Thank you so much for these tests. I didn't go that far and learned that unlikely()

can be used like this.

> Attached the incremental patch below.
>
> ---
>
> perf stat -r 3 -- perf sched bench messaging -l 10000
>
> vanilla
> 125,833.65 msec task-clock:u # 22.102 CPUs utilized ( +- 1.94% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 69,526 page-faults:u # 0.553 K/sec ( +- 0.79% )
> 8,189,667,649 cycles:u # 0.065 GHz ( +- 1.49% ) (83.31%)
> 2,184,284,296 stalled-cycles-frontend:u # 26.67% frontend cycles idle ( +- 4.37% ) (83.32%)
> 1,152,703,719 stalled-cycles-backend:u # 14.08% backend cycles idle ( +- 0.56% ) (83.37%)
> 2,483,312,679 instructions:u # 0.30 insn per cycle
> # 0.88 stalled cycles per insn ( +- 0.15% ) (83.33%)
> 781,332,174 branches:u # 6.209 M/sec ( +- 0.13% ) (83.35%)
> 159,531,476 branch-misses:u # 20.42% of all branches ( +- 0.17% ) (83.32%)
>
> 5.6933 +- 0.0911 seconds time elapsed ( +- 1.60% )
>
> patched
> 129,756.92 msec task-clock:u # 22.243 CPUs utilized ( +- 1.92% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 69,904 page-faults:u # 0.539 K/sec ( +- 0.67% )
> 8,518,161,670 cycles:u # 0.066 GHz ( +- 2.19% ) (83.30%)
> 2,337,165,666 stalled-cycles-frontend:u # 27.44% frontend cycles idle ( +- 5.47% ) (83.32%)
> 1,148,789,343 stalled-cycles-backend:u # 13.49% backend cycles idle ( +- 0.05% ) (83.35%)
> 2,483,527,911 instructions:u # 0.29 insn per cycle
> # 0.94 stalled cycles per insn ( +- 0.18% ) (83.38%)
> 782,138,388 branches:u # 6.028 M/sec ( +- 0.09% ) (83.33%)
> 160,131,311 branch-misses:u # 20.47% of all branches ( +- 0.16% ) (83.31%)
>
> 5.834 +- 0.106 seconds time elapsed ( +- 1.81% )
>
> patched-unlikely
> 127,437.78 msec task-clock:u # 22.184 CPUs utilized ( +- 0.74% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 70,063 page-faults:u # 0.550 K/sec ( +- 0.53% )
> 8,453,581,973 cycles:u # 0.066 GHz ( +- 1.49% ) (83.34%)
> 2,327,192,242 stalled-cycles-frontend:u # 27.53% frontend cycles idle ( +- 2.43% ) (83.32%)
> 1,146,196,558 stalled-cycles-backend:u # 13.56% backend cycles idle ( +- 0.35% ) (83.34%)
> 2,486,920,732 instructions:u # 0.29 insn per cycle
> # 0.94 stalled cycles per insn ( +- 0.10% ) (83.34%)
> 781,067,666 branches:u # 6.129 M/sec ( +- 0.15% ) (83.34%)
> 160,104,212 branch-misses:u # 20.50% of all branches ( +- 0.10% ) (83.33%)
>
> 5.7446 +- 0.0418 seconds time elapsed ( +- 0.73% )
>
> ---
>
> perf stat -r 3 -- perf bench sched pipe
>
> vanilla
> 14,086.14 msec task-clock:u # 1.009 CPUs utilized ( +- 6.52% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 1,467 page-faults:u # 0.104 K/sec ( +- 0.06% )
> 306,181,835 cycles:u # 0.022 GHz ( +- 2.13% ) (83.41%)
> 43,975,811 stalled-cycles-frontend:u # 14.36% frontend cycles idle ( +- 14.45% ) (83.05%)
> 52,429,386 stalled-cycles-backend:u # 17.12% backend cycles idle ( +- 0.28% ) (83.58%)
> 93,097,176 instructions:u # 0.30 insn per cycle
> # 0.56 stalled cycles per insn ( +- 0.36% ) (83.23%)
> 35,351,661 branches:u # 2.510 M/sec ( +- 0.21% ) (83.37%)
> 6,124,932 branch-misses:u # 17.33% of all branches ( +- 0.51% ) (83.36%)
>
> 13.955 +- 0.164 seconds time elapsed ( +- 1.17% )
>
> patched
> 14,574.69 msec task-clock:u # 1.040 CPUs utilized ( +- 0.87% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 1,469 page-faults:u # 0.101 K/sec ( +- 0.13% )
> 302,769,739 cycles:u # 0.021 GHz ( +- 1.19% ) (83.17%)
> 37,638,522 stalled-cycles-frontend:u # 12.43% frontend cycles idle ( +- 0.31% ) (83.47%)
> 46,206,055 stalled-cycles-backend:u # 15.26% backend cycles idle ( +- 6.56% ) (83.34%)
> 92,566,358 instructions:u # 0.31 insn per cycle
> # 0.50 stalled cycles per insn ( +- 0.51% ) (83.45%)
> 35,667,707 branches:u # 2.447 M/sec ( +- 0.67% ) (83.23%)
> 6,224,587 branch-misses:u # 17.45% of all branches ( +- 2.24% ) (83.35%)
>
> 14.010 +- 0.245 seconds time elapsed ( +- 1.75% )
>
> patched-unlikely
> 13,470.99 msec task-clock:u # 1.024 CPUs utilized ( +- 3.10% )
> 0 context-switches:u # 0.000 K/sec
> 0 cpu-migrations:u # 0.000 K/sec
> 1,477 page-faults:u # 0.110 K/sec ( +- 0.09% )
> 310,752,740 cycles:u # 0.023 GHz ( +- 1.32% ) (83.35%)
> 44,894,078 stalled-cycles-frontend:u # 14.45% frontend cycles idle ( +- 13.24% ) (83.47%)
> 52,540,903 stalled-cycles-backend:u # 16.91% backend cycles idle ( +- 0.36% ) (82.96%)
> 92,296,178 instructions:u # 0.30 insn per cycle
> # 0.57 stalled cycles per insn ( +- 0.48% ) (83.44%)
> 35,316,802 branches:u # 2.622 M/sec ( +- 0.06% ) (83.32%)
> 6,173,049 branch-misses:u # 17.48% of all branches ( +- 0.66% ) (83.47%)
>
> 13.161 +- 0.293 seconds time elapsed ( +- 2.22% )
>
>> @@ -833,7 +827,8 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
>> */
>> iter = NULL;
>> while ((group = iterate_groups(next, &iter))) {
>> - if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
>> + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
>> + next->in_memstall == prev->in_memstall) {
>> common = group;
>> break;
> It'd be better to compare psi_flags instead of just in_memstall: it's
> clearer and also more robust against future changes (even though it's
> somewhat unlikely we grow more states). It's also an invariant
> throughout the loop, so we should move it out.
Ok, make sense a lot.
>
> The comment above the loop is now stale too.
>
> Can you fold the following into your patch?

Of course, I will take the following and send a patch-v2 for more review.

Thank you.

>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8735d5f291dc..6d4a246ef131 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -809,18 +809,21 @@ void psi_task_switch(struct task_struct *prev, struct task_struct *next,
> void *iter;
>
> if (next->pid) {
> + bool identical_state;
> +
> psi_flags_change(next, 0, TSK_ONCPU);
> /*
> - * When moving state between tasks, the group that
> - * contains them both does not change: we can stop
> - * updating the tree once we reach the first common
> - * ancestor. Iterate @next's ancestors until we
> - * encounter @prev's state.
> + * When switching between tasks that have an identical
> + * runtime state, the cgroup that contains both tasks
> + * does not change: we can stop updating the tree once
> + * we reach the first common ancestor. Iterate @next's
> + * ancestors only until we encounter @prev's ONCPU.
> */
> + identical_state = prev->psi_flags == next->psi_flags;
> iter = NULL;
> while ((group = iterate_groups(next, &iter))) {
> - if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU] &&
> - next->in_memstall == prev->in_memstall) {
> + if (identical_state &&
> + per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) {
> common = group;
> break;
> }
>
> Otherwise, this looks good to me. Peter, what do you think?
>
> ---
>
> From cf36f1dde714a2c543f5947e47138de32468f33a Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
> Date: Wed, 10 Feb 2021 13:38:34 -0500
> Subject: [PATCH] psi: pressure states are unlikely
>
> Move the unlikely branches out of line. This eliminates undesirable
> jumps during wakeup and sleeps for workloads that aren't under any
> sort of resource pressure.
>
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> ---
> kernel/sched/psi.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c
> index 8735d5f291dc..7fbacd6347a6 100644
> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -216,15 +216,15 @@ static bool test_state(unsigned int *tasks, enum psi_states state)
> {
> switch (state) {
> case PSI_IO_SOME:
> - return tasks[NR_IOWAIT];
> + return unlikely(tasks[NR_IOWAIT]);
> case PSI_IO_FULL:
> - return tasks[NR_IOWAIT] && !tasks[NR_RUNNING];
> + return unlikely(tasks[NR_IOWAIT] && !tasks[NR_RUNNING]);
> case PSI_MEM_SOME:
> - return tasks[NR_MEMSTALL];
> + return unlikely(tasks[NR_MEMSTALL]);
> case PSI_MEM_FULL:
> - return tasks[NR_MEMSTALL] && !tasks[NR_RUNNING];
> + return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
> case PSI_CPU_SOME:
> - return tasks[NR_RUNNING] > tasks[NR_ONCPU];
> + return unlikely(tasks[NR_RUNNING] > tasks[NR_ONCPU]);
> case PSI_NONIDLE:
> return tasks[NR_IOWAIT] || tasks[NR_MEMSTALL] ||
> tasks[NR_RUNNING];
> @@ -721,7 +721,7 @@ static void psi_group_change(struct psi_group *group, int cpu,
> * task in a cgroup is in_memstall, the corresponding groupc
> * on that cpu is in PSI_MEM_FULL state.
> */
> - if (groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall)
> + if (unlikely(groupc->tasks[NR_ONCPU] && cpu_curr(cpu)->in_memstall))
> state_mask |= (1 << PSI_MEM_FULL);
>
> groupc->state_mask = state_mask;