Re: [RFC 1/2] sched: reduce migration cost between faster caches for idle_balance

From: Steven Sistare
Date: Fri Feb 09 2018 - 11:14:21 EST


On 2/8/2018 10:42 PM, Mike Galbraith wrote:
> On Thu, 2018-02-08 at 14:19 -0800, Rohit Jain wrote:
>> This patch makes idle_balance more dynamic as the sched_migration_cost
>> is now accounted on a sched_domain level. This in turn is done in
>> sd_init when we know what the topology relationships are.
>>
>> For introduction sakes cost of migration within the same core is set as
>> 0, across cores is 50 usec and across sockets is 500 usec. sysctl for
>> these variables are introduced in patch 2.
>>
>> Signed-off-by: Rohit Jain <rohit.k.jain@xxxxxxxxxx>
>> ---
>> include/linux/sched/topology.h | 1 +
>> kernel/sched/fair.c | 6 +++---
>> kernel/sched/topology.c | 5 +++++
>> 3 files changed, 9 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> index cf257c2..bcb4db2 100644
>> --- a/include/linux/sched/topology.h
>> +++ b/include/linux/sched/topology.h
>> @@ -104,6 +104,7 @@ struct sched_domain {
>> u64 max_newidle_lb_cost;
>> unsigned long next_decay_max_lb_cost;
>>
>> + u64 sched_migration_cost;
>> u64 avg_scan_cost; /* select_idle_sibling */
>>
>> #ifdef CONFIG_SCHEDSTATS
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2fe3aa8..61d3508 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -8782,8 +8782,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>> */
>> rq_unpin_lock(this_rq, rf);
>>
>> - if (this_rq->avg_idle < sysctl_sched_migration_cost ||
>> - !this_rq->rd->overload) {
>> + if (!this_rq->rd->overload) {
>> rcu_read_lock();
>> sd = rcu_dereference_check_sched_domain(this_rq->sd);
>> if (sd)
>
> Unexplained/unrelated change.

This could be explained better in the cover letter, but it is related; this and the
change below are the meat of the patch. The deleted test "this_rq->avg_idle <
sysctl_sched_migration_cost" formerly bailed based on a single global notion of
migration cost, independent of sd. Now the cost is per-sd, evaluated in the sd loop
below. The other condition to bail early, "!this_rq->rd->overload" is still relevant
and remains.

>> @@ -8804,7 +8803,8 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>> if (!(sd->flags & SD_LOAD_BALANCE))
>> continue;
>>
>> - if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost) {
>> + if (this_rq->avg_idle < curr_cost + sd->max_newidle_lb_cost +
>> + sd->sched_migration_cost) {
>> update_next_balance(sd, &next_balance);
>> break;
>> }
>
> Ditto.

The old code did not migrate if the expected costs exceeded the expected idle
time. The new code just adds the sd-specific penalty (essentially loss of cache
footprint) to the costs. The for_each_domain loop visit smallest to largest
sd's, hence visiting smallest to largest migration costs (though the tunables do
not enforce an ordering), and bails at the first sd where the total cost is a lose.

>> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
>> index 034cbed..bcd8c64 100644
>> --- a/kernel/sched/topology.c
>> +++ b/kernel/sched/topology.c
>> @@ -1148,12 +1148,14 @@ sd_init(struct sched_domain_topology_level *tl,
>> sd->flags |= SD_PREFER_SIBLING;
>> sd->imbalance_pct = 110;
>> sd->smt_gain = 1178; /* ~15% */
>> + sd->sched_migration_cost = 0;
>>
>> } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
>> sd->flags |= SD_PREFER_SIBLING;
>> sd->imbalance_pct = 117;
>> sd->cache_nice_tries = 1;
>> sd->busy_idx = 2;
>> + sd->sched_migration_cost = 500000UL;
>>
>> #ifdef CONFIG_NUMA
>> } else if (sd->flags & SD_NUMA) {
>> @@ -1162,6 +1164,7 @@ sd_init(struct sched_domain_topology_level *tl,
>> sd->idle_idx = 2;
>>
>> sd->flags |= SD_SERIALIZE;
>> + sd->sched_migration_cost = 5000000UL;
>
> That's not 500us.

Good catch, thanks. It's 5000us but should be 500. The latest version of Rohit's patch
lost a little performance vs the previous version, and this might explain why.
Re-testing may bring good news.

- Steve