Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node

From: Srikar Dronamraju
Date: Tue Jun 16 2015 - 13:21:13 EST


> > @@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
> > /* Only consider nodes where both task and groups benefit */
> > taskimp = task_weight(p, nid, dist) - taskweight;
> > groupimp = group_weight(p, nid, dist) - groupweight;
> > - if (taskimp < 0 && groupimp < 0)
> > + if (taskimp < 0 || groupimp < 0)
> > continue;
>
> NAK
>
> Here is the simplest possible example of a workload where
> the above change breaks things.

In which case, shouldnt we be updating the comment above.
The comment says. "/* Only consider nodes where both task and groups
benefit */". From the comment, I felt taskimp and groupimp should be
positive.

The current code is a bit confusing. There are 3 possible cases where
the current code allows us to do further updates.
1. task and group faults improving which is a good case.
2. task faults improve, group faults decrease.
3. task faults decrease, group faults increase. (mostly a good case)

case 2 and case 3 are somewhat contradictory. But our actions seem to be
the same for both. Shouldnt we have given preference to groupimp here ..
i.e shouldnt we be not updating in case 2.

Would a below check be okay?
if (groupimp < 0 || (taskimp < 0 && !groupimp))

<snipped>
>
> However, because a majority of memory accesses for
> each thread are local, your above change will result
> in the kernel not evaluating the best node for the
> process (due to a negative taskimp).
>
> Your patch works fine with the autonuma benchmark
> tests, because they do essentially all shared
> faults, not a more realistic mix of shared and private
> faults like you would see in eg. SPECjbb2005, or in a
> database or virtual machine workload.

IIUC numa02 has only private accesses while numa01 has mostly shared
faults. I plan to run SPECjbb2005 and see how they vary with the
changes.

>
> > @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
> > * and is migrating into one of the workload's active nodes, remember
> > * this node as the task's preferred numa node, so the workload can
> > * settle down.
> > - * A task that migrated to a second choice node will be better off
> > - * trying for a better one later. Do not set the preferred node here.
> > */
> > if (p->numa_group) {
> > - if (env.best_cpu == -1)
> > - nid = env.src_nid;
> > - else
> > - nid = env.dst_nid;
> > -
> > - if (node_isset(nid, p->numa_group->active_nodes))
> > + if (env.dst_nid != p->numa_preferred_nid)
> > sched_setnuma(p, env.dst_nid);
> > }
>
> NAK
>
> This can also break group convergence, by setting the
> task's preferred nid to the node of a CPU it may not
> even be migrating to.

I havent changed the parameters to sched_setnuma. Previously also
sched_setnuma was getting set to env.dst_nid. Whats changed is
evaluating nid based on env.best_cpu and using the nid to see if we
should be setting env.dst_nid as preferred node.

With this change, we are setting the preferred node more often, but we
are not changing the logic of which node gets set as preferred node.

I did wonder if the intent was to set nid as the preferred node. i.e

+ sched_setnuma(p, env.dst_nid);
- sched_setnuma(p, nid);


>
> Setting the preferred nid to a node the task should
> not be migrating to (but may, because the better ones
> are currently overloaded) prevents the task from moving
> to its preferred nid at a later time (when the good
> nodes are no longer overloaded).
>
> Have you tested this patch with any workload that does
> not consist of tasks that are running at 100% cpu time
> for the duration of the test?
>

Okay, I will try to see if I can run some workloads and get back to you.

--
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/