Re: Lower than expected CPU pressure in PSI

From: Johannes Weiner
Date: Mon Feb 10 2020 - 13:04:37 EST


On Sat, Feb 08, 2020 at 11:19:57AM +0100, Peter Zijlstra wrote:
> On Fri, Feb 07, 2020 at 02:08:29PM +0100, Peter Zijlstra wrote:
> > On Thu, Jan 09, 2020 at 11:16:32AM -0500, Johannes Weiner wrote:
> > > On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > > > We added reporting for PSI in cgroups and results are somewhat surprising.
> > > >
> > > > My test setup consists of 3 services:
> > > >
> > > > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > > > * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
> > > > * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1
> > > >
> > > > First service runs unconstrained, the other two compete for CPU.
> > > >
> > > > As expected, I can see 500ms/s sched delay for the latter two and
> > > > aggregated 1000ms/s delay for /system.slice, no surprises here.
> > > >
> > > > However, CPU pressure reported by PSI says that none of my services
> > > > have any pressure on them. I can see around 434ms/s pressure on
> > > > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > > > is surprising for three reasons:
> > > >
> > > > * Pressure is absent for my services (I expect it to match scheed delay)
> > > > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > > > * Pressure on root cgroup is lower than on system.slice
> > >
> > > CPU pressure is currently implemented based only on the number of
> > > *runnable* tasks, not on who gets to actively use the CPU. This works
> > > for contention within cgroups or at the global scope, but it doesn't
> > > correctly reflect competition between cgroups. It also doesn't show
> > > the effects of e.g. cpu cycle limiting through cpu.max where there
> > > might *be* only one runnable task, but it's not getting the CPU.
> > >
> > > I've been working on fixing this, but hadn't gotten around to sending
> > > the patch upstream. Attaching it below. Would you mind testing it?
> > >
> > > Peter, what would you think of the below?
> >
> > I'm not loving it; but I see what it does and I can't quickly see an
> > alternative.
> >
> > My main gripe is doing even more of those cgroup traversals.
> >
> > One thing pick_next_task_fair() does is try and limit the cgroup
> > traversal to the sub-tree that contains both prev and next. Not sure
> > that is immediately applicable here, but it might be worth looking into.
>
> One option I suppose, would be to replace this:

Thanks for looking closer at this, this is a cool idea.

> +static inline void psi_sched_switch(struct task_struct *prev,
> + struct task_struct *next,
> + bool sleep)
> +{
> + if (static_branch_likely(&psi_disabled))
> + return;
> +
> + /*
> + * Clear the TSK_ONCPU state if the task was preempted. If
> + * it's a voluntary sleep, dequeue will have taken care of it.
> + */
> + if (!sleep)
> + psi_task_change(prev, TSK_ONCPU, 0);
> +
> + psi_task_change(next, 0, TSK_ONCPU);
> +}
>
> With something like:
>
> static inline void psi_sched_switch(struct task_struct *prev,
> struct task_struct *next,
> bool sleep)
> {
> struct psi_group *g, *p = NULL;
>
> set = TSK_ONCPU;
> clear = 0;
>
> while ((g = iterate_group(next, &g))) {
> u32 nr_running = per_cpu_ptr(g->pcpu, cpu)->tasks[NR_RUNNING];

[ I'm assuming you meant NR_ONCPU instead of NR_RUNNING since the
incoming task will already be runnable and all its groups will
always have NR_RUNNING elevated.

Would switching this to NR_RUNNABLE / TSK_RUNNABLE be better? ]

Anyway, I implemented this and it seems to be working quite well. When
cgroup siblings contend over a CPU, i.e. context switching doesn't
change the group state, no group updates are performed at all:

# cat /proc/self/cgroup
0::/user.slice/user-0.slice/session-c2.scope
# stress -c 64
stress: info: [216] dispatching hogs: 64 cpu, 0 io, 0 vm, 0 hdd

stress-238 [001] d..2 50.077379: psi_task_switch: stress->[stress] 0/4 cgroups updated
stress-238 [001] d..2 50.077380: psi_task_switch: [stress]->stress 0/4 cgroups updated
stress-265 [003] d..2 50.078379: psi_task_switch: stress->[stress] 0/4 cgroups updated
stress-245 [000] d..2 50.078379: psi_task_switch: stress->[stress] 0/4 cgroups updated
stress-265 [003] d..2 50.078380: psi_task_switch: [stress]->stress 0/4 cgroups updated
stress-245 [000] d..2 50.078380: psi_task_switch: [stress]->stress 0/4 cgroups updated

But even with otherwise no overlap in the user-created hierarchy, at
least the root group updates are avoided:

stress-261 [003] d..2 50.075265: psi_task_switch: stress->[kworker/u8:1] 0/1 cgroups updated
stress-261 [003] d..2 50.075266: psi_task_switch: [stress]->kworker/u8:1 3/4 cgroups updated

---