Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO

From: Johannes Weiner
Date: Wed Jul 18 2018 - 17:54:02 EST

Next message: Bjorn Helgaas: "Re: [PATCH v3] PCI: Check for PCIe downtraining conditions"
Previous message: Srinivas Pandruvada: "[PATCH] cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP"
In reply to: Peter Zijlstra: "Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO"
Next in thread: Peter Zijlstra: "Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, Jul 17, 2018 at 12:03:47PM +0200, Peter Zijlstra wrote:
> This is still a scary amount of accounting; not to mention you'll be
> adding O(cgroup-depth) to this in a later patch.
>
> Where are the performance numbers for all this?

I benchmarked it using our two most scheduling sensitive workloads:
memcache and webserver. They handle a ton of small requests - lots of
wakeups and sleeps with little actual work in between - so they tend
to be canaries for scheduler regressions.

In the tests, the boxes were handling live traffic over the course of
several hours. Half the machines, the control, ran with CONFIG_PSI=n.

For memcache I used eight machines total. They're 2-socket, 14 core,
56 thread boxes. The test runs for half the test period, flips the
test and control kernels on the hardware to rule out HW factors, DC
location etc., then runs the other half of the test.

For the webservers, I used 32 machines total. They're single socket,
16 core, 32 thread machines.

During the memcache test, CPU load was nopsi=78.05% psi=78.98% in the
first half and nopsi=77.52% psi=78.25%, so psi added between 0.7 and
0.9 percentage points to the CPU load, a difference of about 1%.

As far as end-to-end request latency from the client perspective goes,
we don't sample those finely enough to capture the requests going to
those particular machines during the test, but we know the p50
turnaround time in this workload is 54us, and perf bench sched pipe on
those machines show nopsi=5.232666 us/op and psi=5.587347 us/op, so
this doesn't add much here either.

The profile for the pipe benchmark shows:

0.87% sched-pipe [kernel.vmlinux] [k] psi_group_change
0.83% perf.real [kernel.vmlinux] [k] psi_group_change
0.82% perf.real [kernel.vmlinux] [k] psi_task_change
0.58% sched-pipe [kernel.vmlinux] [k] psi_task_change

The webserver load is running inside 4 nested cgroup levels. The CPU
load with both nopsi and psi kernels was indistinguishable at 81%.

For comparison, we had to disable the cgroup cpu controller on the
webservers because it added 4 percentage points to the CPU% during
this same exact test.

Versions of this accounting code now run on 80% of our fleet. None of
our workloads have reported regressions during the rollout.

[ Also note that the webservers that tested the nopsi kernel were
during that time susceptible to swap storms, memory livelocks, and
eventual hardresets because without psi they couldn't run our full
resource isolation stack that would prevent that ;) ]

Let me know if there are other tests I could run.

Next message: Bjorn Helgaas: "Re: [PATCH v3] PCI: Check for PCIe downtraining conditions"
Previous message: Srinivas Pandruvada: "[PATCH] cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP"
In reply to: Peter Zijlstra: "Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO"
Next in thread: Peter Zijlstra: "Re: [PATCH 08/10] psi: pressure stall information for CPU, memory, and IO"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]