Re: RCU CPU stall console spews leads to soft lockup disabled is reasonable ?
From: Zhang Zhen
Date: Wed Jan 21 2015 - 01:54:56 EST
On 2015/1/21 11:13, Zhang Zhen wrote:
> On 2015/1/21 10:26, Zhang Zhen wrote:
>> On 2015/1/20 23:25, Don Zickus wrote:
>>> On Tue, Jan 20, 2015 at 11:09:19AM +0800, Zhang Zhen wrote:
>>>>
>>>>> Of course back then, touch_nmi_watchdog touched all cpus. So a problem
>>>>> like this was masked. I believe this upstream commit 62572e29bc53, solved
>>>>> the problem.
>>>>
>>>> Thanks for your suggestion.
>>>>
>>>> Commit 62572e29bc53 changed the semantics of touch_nmi_watchdog and make it
>>>> only touch local cpu not every one.
>>>> But watchdog_nmi_touch = true only guarantee no hard lockup check on this cpu.
>>>>
>>>> Commit 62572e29bc53 didn't changed the semantics of touch_softlockup_watchdog.
>>>
>>> Ah, yes. I reviewed the commit to quickly yesterday. I thought
>>> touch_softlockup_watchdog was called on every cpu and that commit changed
>>> it to the local cpu. But that was incorrect.
>>>
>>>>>
>>>>> You can apply that commit and see if you if you get both RCU stall
>>>>> messages _and_ softlockup messages. I believe that is what you were
>>>>> expecting, correct?
>>>>>
>>>> Correct, i expect i can get both RCU stall messages _and_ softlockup messages.
>>>> I applied that commit, and i only got RCU stall messages.
>>>
>>> Hmm, I believe the act of printing to the console calls touch_nmi_watchdog
>>> which calls touch_softlockup_watchdog. I think that is the problem there.
>>>
>> Yeah, you are right.
>>
>>> This may not cause other problems but what happens if you comment out the
>>> 'touch_softlockup_watchdog' from the touch_nmi_watchdog function like
>>> below (based on latest upstream cb59670870)?
>>>
>>> The idea is that console printing for that cpu won't reset the softlockup
>>> detector. Again other bad things might happen and this patch may not be a
>>> good final solution, but it can help give me a clue about what is going
>>> on.
>>
>> I commented out the 'touch_softlockup_watchdog' from the touch_nmi_watchdog function
>> (based on latest upstream ec6f34e5b552).
>> This triggered RCU stall and softlockup, but softlockup just printk only once.
>> As you mentioned "other bad things" lead to softlockup just printk only once.
>> What's the bad thing ?
>>
>> / #
>> / # echo 60 > /proc/sys/kernel/watchdog_thresh
>> / # busybox insmod softlockup_test.ko &
>> / # [ 39.044058] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=21002 jiffies, g=78, c=77, q=4)
>> [ 39.044058] INFO: Stall ended before state dump start
>> [ 102.049045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=84007 jiffies, g=78, c=77, q=4)
>> [ 102.049045] INFO: Stall ended before state dump start
>> [ 160.482123] BUG: soft lockup - CPU#0 stuck for 134s! [busybox:54]
>> [ 165.054075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=147012 jiffies, g=78, c=77, q=4)
>> [ 165.054075] INFO: Stall ended before state dump start
>> [ 228.059045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=210017 jiffies, g=78, c=77, q=4)
>> [ 228.059045] INFO: Stall ended before state dump start
>> [ 291.064099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=273022 jiffies, g=78, c=77, q=4)
>> [ 291.064099] INFO: Stall ended before state dump start
>> [ 354.069074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=336027 jiffies, g=78, c=77, q=4)
>> [ 354.069099] INFO: Stall ended before state dump start
>> [ 417.074045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=399032 jiffies, g=78, c=77, q=4)
>> [ 417.074045] INFO: Stall ended before state dump start
>> [ 480.079099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=462037 jiffies, g=78, c=77, q=4)
>> [ 480.079099] INFO: Stall ended before state dump start
>> [ 543.084099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=525042 jiffies, g=78, c=77, q=4)
>> [ 543.084099] INFO: Stall ended before state dump start
>> [ 606.089101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=588047 jiffies, g=78, c=77, q=4)
>> [ 606.089101] INFO: Stall ended before state dump start
>> [ 669.094099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=651052 jiffies, g=78, c=77, q=4)
>> [ 669.094099] INFO: Stall ended before state dump start
>> [ 732.099045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=714057 jiffies, g=78, c=77, q=4)
>> [ 732.099045] INFO: Stall ended before state dump start
>> [ 795.104074] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=777062 jiffies, g=78, c=77, q=4)
>> [ 795.104098] INFO: Stall ended before state dump start
>> [ 858.109046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=840067 jiffies, g=78, c=77, q=4)
>> [ 858.109046] INFO: Stall ended before state dump start
>> [ 921.114100] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=903072 jiffies, g=78, c=77, q=4)
>> [ 921.114100] INFO: Stall ended before state dump start
>> [ 984.119099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=966077 jiffies, g=78, c=77, q=4)
>> [ 984.119099] INFO: Stall ended before state dump start
>> [ 1047.124075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1029082 jiffies, g=78, c=77, q=4)
>> [ 1047.124099] INFO: Stall ended before state dump start
>> [ 1110.129046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1092087 jiffies, g=78, c=77, q=4)
>> [ 1110.129046] INFO: Stall ended before state dump start
>> [ 1173.134045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1155092 jiffies, g=78, c=77, q=4)
>> [ 1173.134045] INFO: Stall ended before state dump start
>> [ 1236.139101] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1218097 jiffies, g=78, c=77, q=4)
>> [ 1236.139101] INFO: Stall ended before state dump start
>> [ 1299.144059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1281102 jiffies, g=78, c=77, q=4)
>> [ 1299.144059] INFO: Stall ended before state dump start
>> [ 1362.149099] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1344107 jiffies, g=78, c=77, q=4)
>> [ 1362.149099] INFO: Stall ended before state dump start
>> [ 1425.154059] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1407112 jiffies, g=78, c=77, q=4)
>> [ 1425.154059] INFO: Stall ended before state dump start
>> [ 1488.159046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1470117 jiffies, g=78, c=77, q=4)
>> [ 1488.159046] INFO: Stall ended before state dump start
>> [ 1551.164045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1533122 jiffies, g=78, c=77, q=4)
>> [ 1551.164045] INFO: Stall ended before state dump start
>> [ 1614.169057] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1596127 jiffies, g=78, c=77, q=4)
>> [ 1614.169057] INFO: Stall ended before state dump start
>> [ 1677.174060] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1659132 jiffies, g=78, c=77, q=4)
>> [ 1677.174060] INFO: Stall ended before state dump start
>> [ 1740.179045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1722137 jiffies, g=78, c=77, q=4)
>> [ 1740.179045] INFO: Stall ended before state dump start
>> [ 1803.184075] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1785142 jiffies, g=78, c=77, q=4)
>> [ 1803.184101] INFO: Stall ended before state dump start
>> [ 1866.189046] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1848147 jiffies, g=78, c=77, q=4)
>> [ 1866.189046] INFO: Stall ended before state dump start
>> [ 1929.194045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1911152 jiffies, g=78, c=77, q=4)
>> [ 1929.194045] INFO: Stall ended before state dump start
>> [ 1992.199083] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=1974157 jiffies, g=78, c=77, q=4)
>> [ 1992.199083] INFO: Stall ended before state dump start
>> [ 2055.204098] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=2037162 jiffies, g=78, c=77, q=4)
>> [ 2055.204098] INFO: Stall ended before state dump start
>> [ 2118.209045] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=2100167 jiffies, g=78, c=77, q=4)
>> [ 2118.209045] INFO: Stall ended before state dump start
>> [ 2181.214098] INFO: rcu_preempt detected stalls on CPUs/tasks: {} (detected by 0, t=2163172 jiffies, g=78, c=77, q=4)
>> [ 2181.214098] INFO: Stall ended before state dump start
>
> Sorry, i made a mistake, the log above is based on v3.10.63.
> I have tested the latest upstream kernel (based on ec6f34e5b552),
> but my test case can't triggered RCU stall warning.
>
I used git bisect to find the bad commit, but found a irrelevant
commit(db5d711e2db776 zram: avoid null access when fail to alloc meta).
Before this commit, my test case can easily trigger RCU stall warning,
but after this commit, my test case can't trigger RCU stall warning.
I'm totally confused.
My test case:
//
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/spinlock.h>
#include <linux/slab.h>
#include <linux/kthread.h>
struct foo {
int a;
char b;
long c;
};
struct foo gbl_foo = {1, 'a', 2};
struct foo *pgbl_foo = &gbl_foo;
static int my_kthread(void *data)
{
DEFINE_SPINLOCK(foo_mutex);
struct foo *new_fp;
struct foo *old_fp;
new_fp = kmalloc(sizeof(*new_fp), GFP_KERNEL);
spin_lock(&foo_mutex);
old_fp = pgbl_foo;
*new_fp = *old_fp;
new_fp->a = 2;
rcu_assign_pointer(pgbl_foo, new_fp);
spin_unlock(&foo_mutex);
synchronize_rcu();
kfree(old_fp);
return 0;
}
static int hello_start(void)
{
struct task_struct *my_task = NULL;
DEFINE_SPINLOCK(hello_lock);
my_task = kthread_run(my_kthread, NULL, "my_thread%d", 1);
spin_lock_init(&hello_lock);
spin_lock(&hello_lock);
spin_lock(&hello_lock);
return 0;
}
static int __init test_init(void)
{
hello_start();
printk(KERN_INFO "Module init\n");
return 0;
}
static void __exit test_exit(void)
{
printk(KERN_INFO "Module exit!\n");
}
module_init(test_init)
module_exit(test_exit)
MODULE_LICENSE("GPL");
//
>>>
>>> Cheers,
>>> Don
>>>
>>> diff --git a/kernel/watchdog.c b/kernel/watchdog.c
>>> index 70bf118..833c015 100644
>>> --- a/kernel/watchdog.c
>>> +++ b/kernel/watchdog.c
>>> @@ -209,7 +209,7 @@ void touch_nmi_watchdog(void)
>>> * going off.
>>> */
>>> raw_cpu_write(watchdog_nmi_touch, true);
>>> - touch_softlockup_watchdog();
>>> + //touch_softlockup_watchdog();
>>> }
>>> EXPORT_SYMBOL(touch_nmi_watchdog);
>>>
>>>
>>> .
>>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/