Re: Lower than expected CPU pressure in PSI

From: Johannes Weiner
Date: Wed Jan 15 2020 - 11:55:48 EST


On Fri, Jan 10, 2020 at 11:28:32AM -0800, Ivan Babrou wrote:
> I applied the patch on top of 5.5.0-rc3 and it's definitely better
> now, both competing cgroups report 500ms/s delay. Feel free to add
> Tested-by from me.

Thanks, Ivan!

> I'm still seeing /unified/system.slice at 385ms/s and /unified.slice
> at 372ms/s, do you have an explanation for that part? Maybe it's
> totally reasonable, but warrants a patch for documentation.

Yes, this is a combination of CPU pinning and how pressure is
calculated in SMP systems.

The stall times are defined as lost compute potential - which scales
with the number of concurrent threads - normalized to wallclock
time. See the "Multiple CPUs" section in kernel/sched/psi.c.

By restricting the CPUs in system.slice, there is less compute
available in that group than in the parent, which means that the
relative loss of potential can be higher.

It's a bit unintuitive because most cgroup metrics are plain numbers
that add up to bigger numbers as you go up the tree. If we exported
both the numerator (waste) and denominator (compute potential) here,
the numbers would act more conventionally, with parent numbers always
bigger than the child's. But because pressure is normalized to
wallclock time, you only see the ratio at each level, and that can
shrink as you go up the tree if lower levels are CPU-constrained.

We could have exported both numbers, but for most usecases that would
be more confusing than helpful. And in practice it's the ratio that
really matters: the pressure in the leaf cgroups is high due to the
CPU restriction; but when you go higher up the tree and look at not
just the pinned tasks, but also include tasks in other groups that
have more CPUs available to them, the aggregate productivity at that
level *is* actually higher.

I hope that helps!