Re: [PATCH v2 3/4] sched:Fix task_numa_migrate to always update preferred node

From: Rik van Riel
Date: Tue Jun 16 2015 - 14:25:51 EST


On 06/16/2015 01:19 PM, Srikar Dronamraju wrote:
>>> @@ -1503,7 +1503,7 @@ static int task_numa_migrate(struct task_struct *p)
>>> /* Only consider nodes where both task and groups benefit */
>>> taskimp = task_weight(p, nid, dist) - taskweight;
>>> groupimp = group_weight(p, nid, dist) - groupweight;
>>> - if (taskimp < 0 && groupimp < 0)
>>> + if (taskimp < 0 || groupimp < 0)
>>> continue;
>>
>> NAK
>>
>> Here is the simplest possible example of a workload where
>> the above change breaks things.
>
> In which case, shouldnt we be updating the comment above.
> The comment says. "/* Only consider nodes where both task and groups
> benefit */". From the comment, I felt taskimp and groupimp should be
> positive.

You are correct. That comment does need fixing.

> The current code is a bit confusing. There are 3 possible cases where
> the current code allows us to do further updates.
> 1. task and group faults improving which is a good case.
> 2. task faults improve, group faults decrease.
> 3. task faults decrease, group faults increase. (mostly a good case)
>
> case 2 and case 3 are somewhat contradictory. But our actions seem to be
> the same for both. Shouldnt we have given preference to groupimp here ..
> i.e shouldnt we be not updating in case 2.

If you look at task_numa_compare(), you will find that
case (2) and (3) are handled for different situations.

If the task faults improve, but group faults decrease,
then a task swap is only done if we are considering
the task score for the task, which is done when comparing
it against another task in the same group.

In other words, if task A and task B are part of the
same numa_group NG, then we will look at the task score
for the tasks.

Case (3), where the numa group placement improves, is
important to help tasks in a numa_group move towards
each other in the system.

If we have a 2-node system with 2 large active numa groups,
and multiple tasks in each group, we want each group to
end up on its own numa node.

This can only happen if we sometimes move tasks to a place
where they have a lower task score, but a higher group score.


> Would a below check be okay?
> if (groupimp < 0 || (taskimp < 0 && !groupimp))

No, for reasons described above.

We want to retain the ability to place tasks correctly within
a numa_group.

> <snipped>
>>
>> However, because a majority of memory accesses for
>> each thread are local, your above change will result
>> in the kernel not evaluating the best node for the
>> process (due to a negative taskimp).
>>
>> Your patch works fine with the autonuma benchmark
>> tests, because they do essentially all shared
>> faults, not a more realistic mix of shared and private
>> faults like you would see in eg. SPECjbb2005, or in a
>> database or virtual machine workload.
>
> IIUC numa02 has only private accesses while numa01 has mostly shared
> faults. I plan to run SPECjbb2005 and see how they vary with the
> changes.

In this context, "private" and "shared" refers to the NUMA faults
experienced by the kernel.

"private" means the same thread is accessing the same pages over
and over again, without (many) other accesses to those pages

"shared" means the same pages are accessed by various threads (in
the numa group)

>>> @@ -1519,16 +1519,9 @@ static int task_numa_migrate(struct task_struct *p)
>>> * and is migrating into one of the workload's active nodes, remember
>>> * this node as the task's preferred numa node, so the workload can
>>> * settle down.
>>> - * A task that migrated to a second choice node will be better off
>>> - * trying for a better one later. Do not set the preferred node here.
>>> */
>>> if (p->numa_group) {
>>> - if (env.best_cpu == -1)
>>> - nid = env.src_nid;
>>> - else
>>> - nid = env.dst_nid;
>>> -
>>> - if (node_isset(nid, p->numa_group->active_nodes))
>>> + if (env.dst_nid != p->numa_preferred_nid)
>>> sched_setnuma(p, env.dst_nid);
>>> }
>>
>> NAK
>>
>> This can also break group convergence, by setting the
>> task's preferred nid to the node of a CPU it may not
>> even be migrating to.
>
> I havent changed the parameters to sched_setnuma. Previously also
> sched_setnuma was getting set to env.dst_nid. Whats changed is
> evaluating nid based on env.best_cpu and using the nid to see if we
> should be setting env.dst_nid as preferred node.
>
> With this change, we are setting the preferred node more often,

That is exactly the problem.

Setting p->numa_preferred_nid means the task will not try to move
to a better numa node frequently, with this code from task_numa_fault():

/*
* Retry task to preferred node migration periodically, in case it
* case it previously failed, or the scheduler moved us.
*/
if (time_after(jiffies, p->numa_migrate_retry))
numa_migrate_preferred(p);

Your change will cause numa_migrate_preferred() to do nothing, despite
the fact that task_numa_migrate placed the task in a location that is
known to be sub-optimal.

The only situation in which we do want to settle for a sub-optimal
location, is if that sub-optimal node is part of the numa_group's
active_nodes set.

The normal place where p->numa_preferred_nid gets set should be
task_numa_placement(), not task_numa_migrate(), which only sets
it in the above exceptional circumstance.

> I did wonder if the intent was to set nid as the preferred node. i.e
>
> + sched_setnuma(p, env.dst_nid);
> - sched_setnuma(p, nid);

Yes, that change would be correct. You did identify a real
bug in task_numa_migrate().

Thank you for going over the code with a fine comb, trying to
find issues.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/