Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

From: Mel Gorman
Date: Thu Jun 14 2018 - 04:36:49 EST

Next message: Feng Tang: "Re: [RFC 1/2] printk: Enable platform to provide a early boot clock"
Previous message: Linus Walleij: "Re: [PATCH] pinctrl: actions: Fix uninitialized error in owl_pin_config_set()"
In reply to: Mel Gorman: "Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks"
Next in thread: Mel Gorman: "Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jun 11, 2018 at 06:07:58PM +0200, Jirka Hladky wrote:
> >
> > Fixing any part of it for STREAM will end up regressing something else.
>
>
> I fully understand that. We run a set of benchmarks and we always look at
> the results as the ensemble. Looking only at one benchmark would be
> completely wrong.
>

Indeed

> And in fact, we do see regression on NAS benchmark going from 4.16 to 4.17
> kernel as well. On 4 NUMA node server with Xeon Gold CPUs we see the
> regression around 26% for ft_C, 35% for mg_C_x and 25% for sp_C_x. The
> biggest regression is with 32 threads (the box has 96 CPUs in total). I
> have not yet tried if it's
> linked to 2c83362734dad8e48ccc0710b5cd2436a0323893. I will do that
> testing tomorrow.
>

It would be worthwhile. However, it's also worth noting that 32 threads
out of 96 implies that 4 nodes would not be evenly used and it may
account for some of the discrepency. ft and mg for C class are typically
short-lived on modern hardware and sp is not particularly long-lived
either. Hence, they are most likely to see problems with a patch that
avoids spreading tasks across the machine early. Admittedly, I have not
seen similar slowdowns but NAS has a lot of configuration options.

In terms of the speed of migration, it may be worth checking how often the
mm_numa_migrate_ratelimit tracepoint is triggered with bonus points for using
the nr_pages to calculate how many pages get throttled from migrating. If
it's high frequency then you could test increasing ratelimit_pages (which
is set at compile time despite not being a macro). It still may not work
for tasks that are too short-lived to have enough time to identify a
misplacement and migration.

--
Mel Gorman
SUSE Labs

Next message: Feng Tang: "Re: [RFC 1/2] printk: Enable platform to provide a early boot clock"
Previous message: Linus Walleij: "Re: [PATCH] pinctrl: actions: Fix uninitialized error in owl_pin_config_set()"
In reply to: Mel Gorman: "Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks"
Next in thread: Mel Gorman: "Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]