Re: Lower than expected CPU pressure in PSI

From: Peter Zijlstra
Date: Sat Feb 08 2020 - 05:20:55 EST


On Fri, Feb 07, 2020 at 02:08:29PM +0100, Peter Zijlstra wrote:
> On Thu, Jan 09, 2020 at 11:16:32AM -0500, Johannes Weiner wrote:
> > On Wed, Jan 08, 2020 at 11:47:10AM -0800, Ivan Babrou wrote:
> > > We added reporting for PSI in cgroups and results are somewhat surprising.
> > >
> > > My test setup consists of 3 services:
> > >
> > > * stress-cpu1-no-contention.service : taskset -c 1 stress --cpu 1
> > > * stress-cpu2-first-half.service : taskset -c 2 stress --cpu 1
> > > * stress-cpu2-second-half.service : taskset -c 2 stress --cpu 1
> > >
> > > First service runs unconstrained, the other two compete for CPU.
> > >
> > > As expected, I can see 500ms/s sched delay for the latter two and
> > > aggregated 1000ms/s delay for /system.slice, no surprises here.
> > >
> > > However, CPU pressure reported by PSI says that none of my services
> > > have any pressure on them. I can see around 434ms/s pressure on
> > > /unified/system.slice and 425ms/s pressure on /unified cgroup, which
> > > is surprising for three reasons:
> > >
> > > * Pressure is absent for my services (I expect it to match scheed delay)
> > > * Pressure on /unified/system.slice is lower than both 500ms/s and 1000ms/s
> > > * Pressure on root cgroup is lower than on system.slice
> >
> > CPU pressure is currently implemented based only on the number of
> > *runnable* tasks, not on who gets to actively use the CPU. This works
> > for contention within cgroups or at the global scope, but it doesn't
> > correctly reflect competition between cgroups. It also doesn't show
> > the effects of e.g. cpu cycle limiting through cpu.max where there
> > might *be* only one runnable task, but it's not getting the CPU.
> >
> > I've been working on fixing this, but hadn't gotten around to sending
> > the patch upstream. Attaching it below. Would you mind testing it?
> >
> > Peter, what would you think of the below?
>
> I'm not loving it; but I see what it does and I can't quickly see an
> alternative.
>
> My main gripe is doing even more of those cgroup traversals.
>
> One thing pick_next_task_fair() does is try and limit the cgroup
> traversal to the sub-tree that contains both prev and next. Not sure
> that is immediately applicable here, but it might be worth looking into.

One option I suppose, would be to replace this:

+static inline void psi_sched_switch(struct task_struct *prev,
+ struct task_struct *next,
+ bool sleep)
+{
+ if (static_branch_likely(&psi_disabled))
+ return;
+
+ /*
+ * Clear the TSK_ONCPU state if the task was preempted. If
+ * it's a voluntary sleep, dequeue will have taken care of it.
+ */
+ if (!sleep)
+ psi_task_change(prev, TSK_ONCPU, 0);
+
+ psi_task_change(next, 0, TSK_ONCPU);
+}

With something like:

static inline void psi_sched_switch(struct task_struct *prev,
struct task_struct *next,
bool sleep)
{
struct psi_group *g, *p = NULL;

set = TSK_ONCPU;
clear = 0;

while ((g = iterate_group(next, &g))) {
u32 nr_running = per_cpu_ptr(g->pcpu, cpu)->tasks[NR_RUNNING];
if (nr_running) {
/* if set, we hit the subtree @prev lives in, terminate */
p = g;
break;
}

/* the rest of psi_task_change */
}

if (sleep)
return;

set = 0;
clear = TSK_ONCPU;

while ((g = iterate_group(prev, &g))) {
if (g == p)
break;

/* the rest of psi_task_change */
}
}

That way we avoid clearing and setting the common parents.