RE: [PATCH] Fix the race between smp_call_function and CPU booting

From: Liu, Chuansheng
Date: Mon Mar 19 2012 - 20:22:30 EST




> -----Original Message-----
> From: Peter Zijlstra [mailto:peterz@xxxxxxxxxxxxx]
> Sent: Monday, March 19, 2012 6:03 PM
> To: Liu, Chuansheng
> Cc: linux-kernel@xxxxxxxxxxxxxxx; Yanmin Zhang; tglx@xxxxxxxxxxxxx
> Subject: RE: [PATCH] Fix the race between smp_call_function and CPU booting
>
> On Mon, 2012-03-19 at 00:58 +0000, Liu, Chuansheng wrote:
> > Your patch advance the setting active bit before online setting, that
> > will cause an warning error,
>
> WHY!?
I have done the stress tests based on your patches. The following warning error is very easy to
be reproduced, paste my result again. Thanks to give some time to have a look.

I did a stress test that starting two different scripts concurrently:
1/ onoff_line script like below:
while true
do
echo 0 > /sys/devices/system/cpu/cpu1/online
echo 1 > /sys/devices/system/cpu/cpu1/online
done
2/ Adding a simple sys interface to trigger calling smp_call_function:
test_set()
{
smp_call_function(...);
}

The script is writing the interface to trigger the calling in loop every 500ms;


The result is:
1/ without any patch, the deadlock issue is very easy to be reproduced;

2/ With your patch http://lkml.org/lkml/2011/12/15/255, the below issue is always found, and the system is hanging there.
I think it is because the booted CPU1 is set to active too early and the online do not be set yet.
[ 721.759736] cpu_down
[ 721.822193] LCS test smp_call_function [ 721.864892] CPU 1 is now offline [ 721.868270] SMP alternatives: switching to UP code [ 721.886925] _cpu_up [ 721.892222] SMP alternatives: switching to SMP code [ 721.906420] Booting Node 0 Processor 1 APIC 0x1 [ 721.921177] Initializing CPU#1 [ 721.981898] ------------[ cut here ]------------ [ 721.989553] WARNING: at /root/r3_ics/hardware/intel/linux-2.6/arch/x86/kernel/smp.c:118 native_smp_send_reschedule+0x50/0x60()
[ 722.000923] Hardware name: Medfield
[ 722.004401] Modules linked in: atomisp lm3554 mt9m114 mt9e013 videobuf_vmalloc videobuf_core mac80211 cfg80211 compat btwilink st_drv [ 722.016408] Pid: 18865, comm: workqueue_trust Not tainted 3.0.8-137166-g2639a16-dirty #1
[ 722.024486] Call Trace:
[ 722.026939] [<c1252287>] warn_slowpath_common+0x77/0x130
[ 722.032321][<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[ 722.038314] [<c121df70>] ? native_smp_send_reschedule+0x50/0x60
[ 722.044316] [<c1252362>] warn_slowpath_null+0x22/0x30
[ 722.049445] [<c121df70>] native_smp_send_reschedule+0x50/0x60
[ 722.055268] [<c124bacf>] try_to_wake_up+0x17f/0x390
[ 722.060225] [<c124bd34>] wake_up_process+0x14/0x20
[ 722.065091] [<c1277107>] kthread_stop+0x37/0x100
[ 722.069789] [<c126f5e0>] destroy_worker+0x50/0x90
[ 722.074573] [<c18c1b4d>] trustee_thread+0x3e3/0x4bf
[ 722.079524] [<c1277410>] ? wake_up_bit+0x90/0x90
[ 722.084224] [<c18c176a>] ? wait_trustee_state+0x91/0x91
[ 722.089520] [<c1276fc4>] kthread+0x74/0x80 [ 722.093694]
[<c1276f50>] ? __init_kthread_worker+0x30/0x30 [ 722.099264]
[<c18c7cfa>] kernel_thread_helper+0x6/0x10 [ 722.104474]
---[ end trace fa5bcc15ece677c6 ]---

3/ With my patch, the system kept there for 1 hour ,did not find issue yet.
I will keep the stress test running for a long long time;

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/