Re: autoNUMA web workload regression

From: Rik van Riel
Date: Wed May 06 2015 - 10:40:48 EST

Next message: Alan Stern: "Re: [PATCH v3 1/2] PM / sleep: Let devices force direct_complete"
Previous message: leroy christophe: "Re: [PATCH v2] splice: sendfile() at once fails for big files"
In reply to: Bityutskiy, Artem: "Re: autoNUMA web workload regression"
Next in thread: Rik van Riel: "[PATCH] numa,sched: only consider less busy nodes as numa balancing destination"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

CC'ing Peter & Mel. Leaving Artem's email intact so
they can read it :)

On 05/06/2015 06:35 AM, Artem Bityutskiy wrote:

Hi Rik,

we observe a tremendous regression between kernel version 3.16 and 3.17
(and up), and I've bisected it to this commit:

a43455a sched/numa: Ensure task_numa_migrate() checks the preferred node

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a43455a1d572daf7b730fe12eb747d1e17411365

We run a Web server (nginx) on a 2-socket Haswell server and we emulate
an e-Commerce Web-site. Clients send requests to the server and measure
the response time. Clients load the server quite heavily - CPU
utilization is more than 90% as measured with turbostat. We use Fedora
20.

If I take 3.17 and revert this patch, I observe 600% or more average
response time improvement comparing to vanilla 3.17.

If I take 4.1-rc1 and revert this patch, I observe 300% or more average
response time improvement comparing to vanilla 3.17.

I asked Fengguang Wu to run LKP workloads on multiple 4 and 8 socket
machines for v4.1-rc1 with and without this patch, and there seem to be
no difference - all the micro-benchmarks performed similarly and the
difference were withing the error range.

IOW, it looks like this patch has bad effect on Web server QoS (slower
response time). What do you think?

The changeset you found fixes the issue where both
node A and B are fully loaded (or overloaded), and
tasks are located on the wrong node.

Without that changeset, workloads in that situation
will never converge, because we do not consider the
best node for a task.

I have seen that changeset cause another regression
in the past, but on a much less heavily loaded
system, with around 20-50% CPU utilization, and a
single process multi-threaded workload, it causes
the workload to not be properly spread out across
the system.

I wonder if we should try a changeset where the
NUMA balancing code never considers moving a task
from a less busy to a busier node, regardless
of whether or not the destination node is the
preferred node, or some other node?

I can cook up a quick patch to test that out.

Any opinions Peter or Mel?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alan Stern: "Re: [PATCH v3 1/2] PM / sleep: Let devices force direct_complete"
Previous message: leroy christophe: "Re: [PATCH v2] splice: sendfile() at once fails for big files"
In reply to: Bityutskiy, Artem: "Re: autoNUMA web workload regression"
Next in thread: Rik van Riel: "[PATCH] numa,sched: only consider less busy nodes as numa balancing destination"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]