Re: [PATCH] power, sched: stop updating inside arch_update_cpu_topology() when nothing to be update
From: Michael wang
Date: Mon Apr 07 2014 - 22:40:41 EST
Hi, Srivatsa
It's nice to have you confirmed the fix, and thanks for the well-writing
comments, will apply them and send out the new patch later :)
Regards,
Michael Wang
On 04/07/2014 06:15 PM, Srivatsa S. Bhat wrote:
> Hi Michael,
>
> On 04/02/2014 08:59 AM, Michael wang wrote:
>> During the testing, we encounter below WARN followed by Oops:
>>
>> WARNING: at kernel/sched/core.c:6218
>> ...
>> NIP [c000000000101660] .build_sched_domains+0x11d0/0x1200
>> LR [c000000000101358] .build_sched_domains+0xec8/0x1200
>> PACATMSCRATCH [800000000000f032]
>> Call Trace:
>> [c00000001b103850] [c000000000101358] .build_sched_domains+0xec8/0x1200
>> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
>> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
>> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
>> ...
>> Oops: Kernel access of bad area, sig: 11 [#1]
>> ...
>> NIP [c00000000045c000] .__bitmap_weight+0x60/0xf0
>> LR [c00000000010132c] .build_sched_domains+0xe9c/0x1200
>> PACATMSCRATCH [8000000000029032]
>> Call Trace:
>> [c00000001b1037a0] [c000000000288ff4] .kmem_cache_alloc_node_trace+0x184/0x3a0
>> [c00000001b103850] [c00000000010132c] .build_sched_domains+0xe9c/0x1200
>> [c00000001b1039a0] [c00000000010aad4] .partition_sched_domains+0x484/0x510
>> [c00000001b103aa0] [c00000000016d0a8] .rebuild_sched_domains+0x68/0xa0
>> [c00000001b103b30] [c00000000005cbf0] .topology_work_fn+0x10/0x30
>> ...
>>
>> This was caused by that 'sd->groups == NULL' after building groups, which
>> was caused by the empty 'sd->span'.
>>
>> The cpu's domain contain nothing because the cpu was assigned to wrong
>> node inside arch_update_cpu_topology() by calling update_lookup_table()
>> with the uninitialized param, in the case when there is nothing to be
>> update.
>>
>
> Can you reword the above paragraph to something like this:
>
> The cpu's domain contained nothing because the cpu was assigned to a wrong
> node, due to the following unfortunate sequence of events:
>
> 1. The hypervisor sent a topology update to the guest OS, to notify changes
> to the cpu-node mapping. However, the update was actually redundant - i.e.,
> the "new" mapping was exactly the same as the old one.
>
> 2. Due to this, the 'updated_cpus' mask turned out to be empty after exiting
> the 'for-loop' in arch_update_cpu_topology().
>
> 3. So we ended up calling stop-machine() with an empty cpumask list, which made
> stop-machine internally elect cpumask_first(cpu_online_mask), i.e., CPU0 as
> the cpu to run the payload (the update_cpu_topology() function).
>
> 4. This causes update_cpu_topology() to be run by CPU0. And since 'updates'
> is kzalloc()'ed inside arch_update_cpu_topology(), update_cpu_topology()
> finds update->cpu as well as update->new_nid to be 0. In other words, we
> end up assigning CPU0 (and eventually its siblings) to node 0, incorrectly.
>
> This causes the sched-domain rebuild code to break and crash the system.
>
>
>> Thus we should stop the updating in such cases, this patch will achieve
>> this and fix the issue.
>>
>
> We can reword this part as:
>
> Fix this by skipping the topology update in cases where we find that
> the topology has not actually changed in reality (ie., spurious updates).
>
>> CC: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>
>> CC: Paul Mackerras <paulus@xxxxxxxxx>
>> CC: Nathan Fontenot <nfont@xxxxxxxxxxxxxxxxxx>
>> CC: Stephen Rothwell <sfr@xxxxxxxxxxxxxxxx>
>> CC: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
>> CC: Robert Jennings <rcj@xxxxxxxxxxxxxxxxxx>
>> CC: Jesse Larrew <jlarrew@xxxxxxxxxxxxxxxxxx>
>> CC: "Srivatsa S. Bhat" <srivatsa.bhat@xxxxxxxxxxxxxxxxxx>
>> CC: Alistair Popple <alistair@xxxxxxxxxxxx>
>> Signed-off-by: Michael Wang <wangyun@xxxxxxxxxxxxxxxxxx>
>> ---
>> arch/powerpc/mm/numa.c | 9 +++++++++
>> 1 file changed, 9 insertions(+)
>>
>> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> index 30a42e2..6757690 100644
>> --- a/arch/powerpc/mm/numa.c
>> +++ b/arch/powerpc/mm/numa.c
>> @@ -1591,6 +1591,14 @@ int arch_update_cpu_topology(void)
>> cpu = cpu_last_thread_sibling(cpu);
>> }
>>
>> + /*
>> + * The 'cpu_associativity_changes_mask' could be cleared if
>> + * all the cpus it indicates won't change their node, in
>> + * which case the 'updated_cpus' will be empty.
>> + */
>
> How about rewording the comment like this:
>
> In cases where we have nothing to update (because the updates list
> is too short or because the new topology is same as the old one),
> skip invoking update_cpu_topology() via stop-machine(). This is
> necessary (and not just a fast-path optimization) because stop-machine
> can end up electing a random CPU to run update_cpu_topology(), and
> thus trick us into setting up incorrect cpu-node mappings (since
> 'updates' is kzalloc()'ed).
>
> Regards,
> Srivatsa S. Bhat
>
>> + if (!cpumask_weight(&updated_cpus))
>> + goto out;
>> +
>> stop_machine(update_cpu_topology, &updates[0], &updated_cpus);
>>
>> /*
>> @@ -1612,6 +1620,7 @@ int arch_update_cpu_topology(void)
>> changed = 1;
>> }
>>
>> +out:
>> kfree(updates);
>> return changed;
>> }
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/