Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node

From: Rik van Riel
Date: Tue Jun 16 2015 - 10:54:25 EST


On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7b23efa..d1aa374 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
> /* Only consider nodes where both task and groups benefit */
> taskimp = task_weight(p, nid, dist) - taskweight;
> groupimp = group_weight(p, nid, dist) - groupweight;
> - if (taskimp < 0 && groupimp < 0)
> + if (taskimp < 0 || groupimp < 0)
> continue;

NAK

Here is the simplest possible example of a workload where
the above change breaks things.

1) A two node system
2) Two processes, with two threads each
3) Two thirds of memory accesses in private faults,
the other third in shared faults (shared memory
with the other thread)

Within workload A, we have threads T1 and T2, each
on different NUMA nodes, N1 and N2. The same is true
inside workload B.

They each access most of their memory on their own node,
but also have a lot of shared memory access with each
other.

Things would run faster if each process had both of
it threads on the same NUMA node.

This involves swapping around the threads of the two
workloads, moving a thread of each process to the node
where the whole process has the highest group_weight.

However, because a majority of memory accesses for
each thread are local, your above change will result
in the kernel not evaluating the best node for the
process (due to a negative taskimp).

Your patch works fine with the autonuma benchmark
tests, because they do essentially all shared
faults, not a more realistic mix of shared and private
faults like you would see in eg. SPECjbb2005, or in a
database or virtual machine workload.

> @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
> * and is migrating into one of the workload's active nodes, remember
> * this node as the task's preferred numa node, so the workload can
> * settle down.
> - * A task that migrated to a second choice node will be better off
> - * trying for a better one later. Do not set the preferred node here.
> */
> if (p->numa_group) {
> - if (env.best_cpu == -1)
> - nid = env.src_nid;
> - else
> - nid = env.dst_nid;
> -
> - if (node_isset(nid, p->numa_group->active_nodes))
> + if (env.dst_nid != p->numa_preferred_nid)
> sched_setnuma(p, env.dst_nid);
> }

NAK

This can also break group convergence, by setting the
task's preferred nid to the node of a CPU it may not
even be migrating to.

Setting the preferred nid to a node the task should
not be migrating to (but may, because the better ones
are currently overloaded) prevents the task from moving
to its preferred nid at a later time (when the good
nodes are no longer overloaded).

Have you tested this patch with any workload that does
not consist of tasks that are running at 100% cpu time
for the duration of the test?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/