Re: Hackbench pipes regression bisected to PSI

From: Johannes Weiner
Date: Mon Nov 26 2018 - 11:07:30 EST


Hi Mel,

On Mon, Nov 26, 2018 at 01:34:20PM +0000, Mel Gorman wrote:
> Hi Johannes,
>
> PSI is a great idea but it does have overhead and if enabled by Kconfig
> then it incurs a hit whether the user is aware of the feature or not. I
> think enabling by default is unnecessary as it should only be enabled if
> the information is being consumed. While the Kconfig exists, it's all or
> nothing if distributions want to have the feature available.

Yes, let's make this easier to pick and choose. Obviously I'd rather
you shipped it default-disabled than not at all.

> I've included a bisection report below showing a 6-10% regression on a
> single socket skylake machine. Would you mind doing one or all of the
> following to fix it please?
>
> a) disable it by default
> b) put psi_disable behind a static branch to move the overhead to zero
> if it's disabled
> c) optionally enable/disable at runtime (least important as at a glance,
> this may be problematic)

For a) I'd suggest we do what we do in other places that face this
vendor kernel trade-off (NUMA balancing comes to mind): one option to
build the feature, one option to set whether the default is on or off.

And b) is pretty straight-forward, let's do that too.

c) is not possible, as we need the complete task counts to calculate
pressure, and maintaining those counts are where the sched cost is.

> Last good/First bad commit
> ==========================
> Last good commit: eb414681d5a07d28d2ff90dc05f69ec6b232ebd2
> First bad commit: 2ce7135adc9ad081aa3c49744144376ac74fea60
> From 2ce7135adc9ad081aa3c49744144376ac74fea60 Mon Sep 17 00:00:00 2001
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
> Date: Fri, 26 Oct 2018 15:06:31 -0700
> Subject: [PATCH] psi: cgroup support
> On a system that executes multiple cgrouped jobs and independent
> workloads, we don't just care about the health of the overall system, but
> also that of individual jobs, so that we can ensure individual job health,
> fairness between jobs, or prioritize some jobs over others.
> This patch implements pressure stall tracking for cgroups. In kernels
> with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure,
> and io.pressure files that track aggregate pressure stall times for only
> the tasks inside the cgroup.

It's curious that the cgroup support patch is the offender, not the
psi patch itself (that adds some cost as per the hackbench results,
but not as much). What kind of cgroup setup does this code run in?

Anyway, how about the following?