Re: linux-next: Tree for March 25 (Call trace: RCU|workqueues|block|VFS|ext4related?)

From: Sedat Dilek
Date: Fri Mar 25 2011 - 12:26:08 EST


On Fri, Mar 25, 2011 at 4:55 PM, Josh Triplett <josh@xxxxxxxxxxxxxxxx> wrote:
> On Fri, Mar 25, 2011 at 02:05:33PM +0100, Sedat Dilek wrote:
>> On Fri, Mar 25, 2011 at 11:16 AM, Sedat Dilek
>> <sedat.dilek@xxxxxxxxxxxxxx> wrote:
>> > right after I have finished building a new linux-next kernel, booting
>> > into desktop and archiving my build-tree (ext4) as tarball to an
>> > external USB harddisk (partition there is ext3).
>> > ( Yesterday, I have seen similiar call-traces in my logs, but it was
>> > hard to reproduce [1]. )
>> > I am unsure from where the problem aroses, if you have a hint, let me know.
>> >
>> > Regards,
>> > - Sedat -
>> >
>> > [1] http://lkml.org/lkml/2011/3/24/268
>> >
>> > P.S.: Attached are the dmesg outputs and my kernel-config
>> >
>>
>> I turned off the notebook for about 2hrs to avoid thermal problems and
>> hoax reports.
>> Jumped into desktop and started an archive job as 1st job while doing daily job.
>> Yeah, it is reproducible.
> [...]
>> [ Â212.453822] EXT3-fs (sdb5): mounted filesystem with ordered data mode
>> [ Â273.224044] INFO: rcu_sched_state detected stall on CPU 0 (t=15000 jiffies)
>
> 15000 jiffies matches this 60-second gap, assuming you use HZ=250.
>

Yeah, CONFIG_HZ_250=y.
Change HZ value?

Might be worth to change (new) default CONFIG_RCU_CPU_STALL_TIMEOUT=60 ?
Increase/Decrease?
Can I change the value from user-space? If yes, how?
Any other hints?

$ egrep -i '_HZ|_RCU' /boot/config-2.6.38-next20110325-2-686-iniza
CONFIG_TREE_RCU=y
# CONFIG_PREEMPT_RCU is not set
# CONFIG_RCU_TRACE is not set
CONFIG_RCU_FANOUT=32
# CONFIG_RCU_FANOUT_EXACT is not set
CONFIG_RCU_FAST_NO_HZ=y
# CONFIG_TREE_RCU_TRACE is not set
CONFIG_NO_HZ=y
# CONFIG_HZ_100 is not set
CONFIG_HZ_250=y
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=250
# CONFIG_SPARSE_RCU_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60


>> [ Â273.224059] sending NMI to all CPUs:
>> [ Â273.224074] NMI backtrace for cpu 0
>> [ Â273.224081] Modules linked in: ext3 jbd bnep rfcomm bluetooth aes_i586 aes_generic binfmt_misc ppdev acpi_cpufreq mperf cpufreq_powersave cpufreq_userspace lp cpufreq_stats cpufreq_conservative fuse snd_intel8x0 snd_intel8x0m snd_ac97_codec ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm radeon thinkpad_acpi snd_seq_midi pcmcia ttm snd_rawmidi snd_seq_midi_event drm_kms_helper yenta_socket snd_seq pcmcia_rsrc drm pcmcia_core joydev snd_timer snd_seq_device snd i2c_algo_bit tpm_tis shpchp i2c_i801 tpm nsc_ircc irda snd_page_alloc soundcore pci_hotplug rng_core i2c_core tpm_bios psmouse crc_ccitt nvram parport_pc pcspkr parport evdev battery video ac processor power_supply serio_raw button arc4 ecb ath5k ath mac80211 cfg80211 rfkill autofs4 ext4 mbcache jbd2 crc16 dm_mod usbhid hid usb_storage uas sg sd_mod sr_mod crc_t10dif cdrom ata_generic ata_piix libata uhci_hcd ehci_hcd usbcore scsi_mod thermal e1000 thermal_sys floppy [last unloaded: scsi_wait_scan]
>> [ Â273.224367]
>> [ Â273.224377] Pid: 0, comm: swapper Not tainted 2.6.38-next20110325-2-686-iniza #1 IBM 2374SG6/2374SG6
>> [ Â273.224397] EIP: 0060:[<c11514f0>] EFLAGS: 00000807 CPU: 0
>> [ Â273.224414] EIP is at delay_tsc+0x16/0x5e
>> [ Â273.224424] EAX: 00090d42 EBX: 00002710 ECX: c133faf5 EDX: 00090d41
>> [ Â273.224435] ESI: 00000000 EDI: 00090d42 EBP: f5819e9c ESP: f5819e8c
>> [ Â273.224445] ÂDS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
>> [ Â273.224458] Process swapper (pid: 0, ti=f5818000 task=c13e3fa0 task.ti=c13b6000)
>> [ Â273.224466] Stack:
>> [ Â273.224472] Â00090d41 00002710 c13ee580 c13ee600 f5819ea4 c115149f f5819eac c11514bb
>> [ Â273.224497] Âf5819eb8 c1016532 c13ee580 f5819ed4 c1078dc1 c134e61e c134e6c2 00000000
>> [ Â273.224520] Â00003a98 f5c03488 f5819ee8 c1078e36 00000000 00000000 c13e3fa0 f5819ef4
>> [ Â273.224544] Call Trace:
>> [ Â273.224559] Â[<c115149f>] __delay+0x9/0xb
>> [ Â273.224571] Â[<c11514bb>] __const_udelay+0x1a/0x1c
>> [ Â273.224590] Â[<c1016532>] arch_trigger_all_cpu_backtrace+0x50/0x62
>> [ Â273.224608] Â[<c1078dc1>] check_cpu_stall+0x58/0xb8
>> [ Â273.224622] Â[<c1078e36>] __rcu_pending+0x15/0xc4
>> [ Â273.224637] Â[<c10791df>] rcu_check_callbacks+0x6d/0x93
>> [ Â273.224652] Â[<c1039c6c>] update_process_times+0x2d/0x58
>> [ Â273.224666] Â[<c10509e9>] tick_sched_timer+0x6b/0x9a
>> [ Â273.224682] Â[<c1047196>] __run_hrtimer+0x9c/0x111
>> [ Â273.224694] Â[<c105097e>] ? tick_sched_timer+0x0/0x9a
>> [ Â273.224708] Â[<c1047b38>] hrtimer_interrupt+0xd6/0x1bb
>> [ Â273.224727] Â[<c104fca1>] tick_do_broadcast.constprop.4+0x38/0x6a
>> [ Â273.224741] Â[<c104fd80>] tick_handle_oneshot_broadcast+0xad/0xe1
>> [ Â273.224757] Â[<c1076cc2>] ? handle_level_irq+0x0/0x63
>> [ Â273.224772] Â[<c1004215>] timer_interrupt+0x15/0x1c
>> [ Â273.224785] Â[<c107536d>] handle_irq_event_percpu+0x4e/0x164
>> [ Â273.224799] Â[<c1076cc2>] ? handle_level_irq+0x0/0x63
>> [ Â273.224811] Â[<c10754b9>] handle_irq_event+0x36/0x51
>> [ Â273.224824] Â[<c1076cc2>] ? handle_level_irq+0x0/0x63
>> [ Â273.224837] Â[<c1076d0f>] handle_level_irq+0x4d/0x63
>> [ Â273.224845] Â<IRQ>
>> [ Â273.224857] Â[<c1003b8d>] ? do_IRQ+0x35/0x80
>> [ Â273.224871] Â[<c12ac0f0>] ? common_interrupt+0x30/0x38
>> [ Â273.224886] Â[<c10400d8>] ? destroy_worker+0x52/0x6c
>> [ Â273.224922] Â[<f87b730f>] ? arch_local_irq_enable+0x5/0xb [processor]
>> [ Â273.224947] Â[<f87b7ef5>] ? acpi_idle_enter_simple+0x100/0x138 [processor]
>> [ Â273.224964] Â[<c11ebd92>] ? cpuidle_idle_call+0xc2/0x137
>> [ Â273.224978] Â[<c1001da3>] ? cpu_idle+0x89/0xa3
>> [ Â273.224995] Â[<c128c26c>] ? rest_init+0x58/0x5a
>> [ Â273.225008] Â[<c1418722>] ? start_kernel+0x315/0x31a
>> [ Â273.225022] Â[<c14180a2>] ? i386_start_kernel+0xa2/0xaa
>> [ Â273.225029] Code: e5 e8 d6 ff ff ff 5d c3 55 89 e5 8d 04 80 e8 c9 ff ff ff 5d c3 55 89 e5 57 89 c7 56 53 52 64 8b 35 04 20 47 c1 8d 76 00 0f ae e8 <e8> 6b ff ff ff 89 c3 8d 76 00 0f ae e8 e8 5e ff ff ff 89 c2 29
>> [ Â273.225154] Call Trace:
>> [ Â273.225166] Â[<c115149f>] __delay+0x9/0xb
>> [ Â273.225178] Â[<c11514bb>] __const_udelay+0x1a/0x1c
>> [ Â273.225192] Â[<c1016532>] arch_trigger_all_cpu_backtrace+0x50/0x62
>> [ Â273.225207] Â[<c1078dc1>] check_cpu_stall+0x58/0xb8
>> [ Â273.225220] Â[<c1078e36>] __rcu_pending+0x15/0xc4
>> [ Â273.225234] Â[<c10791df>] rcu_check_callbacks+0x6d/0x93
>> [ Â273.225247] Â[<c1039c6c>] update_process_times+0x2d/0x58
>> [ Â273.225260] Â[<c10509e9>] tick_sched_timer+0x6b/0x9a
>> [ Â273.225274] Â[<c1047196>] __run_hrtimer+0x9c/0x111
>> [ Â273.225286] Â[<c105097e>] ? tick_sched_timer+0x0/0x9a
>> [ Â273.225300] Â[<c1047b38>] hrtimer_interrupt+0xd6/0x1bb
>> [ Â273.225316] Â[<c104fca1>] tick_do_broadcast.constprop.4+0x38/0x6a
>> [ Â273.225330] Â[<c104fd80>] tick_handle_oneshot_broadcast+0xad/0xe1
>> [ Â273.225345] Â[<c1076cc2>] ? handle_level_irq+0x0/0x63
>> [ Â273.225358] Â[<c1004215>] timer_interrupt+0x15/0x1c
>> [ Â273.225370] Â[<c107536d>] handle_irq_event_percpu+0x4e/0x164
>> [ Â273.225384] Â[<c1076cc2>] ? handle_level_irq+0x0/0x63
>> [ Â273.225396] Â[<c10754b9>] handle_irq_event+0x36/0x51
>> [ Â273.225409] Â[<c1076cc2>] ? handle_level_irq+0x0/0x63
>> [ Â273.225421] Â[<c1076d0f>] handle_level_irq+0x4d/0x63
>> [ Â273.225429] Â<IRQ> Â[<c1003b8d>] ? do_IRQ+0x35/0x80
>> [ Â273.225450] Â[<c12ac0f0>] ? common_interrupt+0x30/0x38
>> [ Â273.225464] Â[<c10400d8>] ? destroy_worker+0x52/0x6c
>> [ Â273.225493] Â[<f87b730f>] ? arch_local_irq_enable+0x5/0xb [processor]
>> [ Â273.225517] Â[<f87b7ef5>] ? acpi_idle_enter_simple+0x100/0x138 [processor]
>> [ Â273.225532] Â[<c11ebd92>] ? cpuidle_idle_call+0xc2/0x137
>> [ Â273.225545] Â[<c1001da3>] ? cpu_idle+0x89/0xa3
>> [ Â273.225559] Â[<c128c26c>] ? rest_init+0x58/0x5a
>> [ Â273.225571] Â[<c1418722>] ? start_kernel+0x315/0x31a
>> [ Â273.225584] Â[<c14180a2>] ? i386_start_kernel+0xa2/0xaa
>
> Interesting. ÂLooks like RCU detected a stall while the CPU sits in
> cpu_idle. ÂThat *shouldn't* happen...
>

The 2nd dmesg showed me (as there are no more VFS/EXT4 call traces)
the problem could occur from RCU stuff.

> - Josh Triplett
>

Regards,
- Sedat -
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/