As You may remember I have been tracking down the very quick, repeatable
lockups I have been experiencing with the PDC20246 and SMP. I was able to
obtain an OOPS and decode it, so I am forwarding it at this point.
As a refresher, my system is: RH60 base, 2.2.12 stock kernel, abit bp-6
motherboard, dual celeron 366 cpu's, single pdc20246 with two drives, one
connected to each channel.
I've now experienced this lockup on three separate BP-6 based system (my
only SMP systems available, unfortunately). One is all IDE, and has 3
PDC20246 controllers and a total of 11 drives. Two others are normally scsi
based systems (ncr53c8xx) but with the addition of a single pdc20246 card
they will lockup as well.
In tracking it down, it appears the basic cause is using both channels on a
pdc20246 while in SMP mode. Two drives on one channel (with the other free)
will not hang the system, nor will two drives on two controllers (with the
2nd channel on each controller). However, whenever two channels are used on
one PDC20246, the systems lock up within seconds.
I've previously included detailed boot logs, and much more detailed
information about my hardware, so I won't use up the bandwidth again.
However, I'd be happy to send any additional information required.
To cause this behavior, boot the 2.2.12 kernel with SMP, and cause
concurrent access on both drives. I do:
'hdparm -t /dev/hde & hdparm -t /dev/hdg'
Note that I can also cause this lockup when using 2.2.10, 2.2.13pre12, and
2.2.12 + the ide patch. All cause the same behavior, so I chose to report
this OOPS using 2.2.12 as the most 'vanilla'. This lockup does not occur at
all with 2.2.6, the last stable revision I used. Any kernel version is
stable in UP mode or with maxcpus=1.
The system normally hangs without an OOPS, so to generate an OOPS I patched
in ikd and enabled the NMI oopser with an IRQ source as 0.
Note that this is the first time I have reported an OOPS, so if I have done
something stupid, I'm sorry... please let me know. I am very motivated to
help get this fixed, and am happy to rerun these tests or report more
detailed information.
Without further ado the OOPS (there are three in a row):
CPU: 1
EIP: 0010:[<c01d0e1d>]
EFLAGS: 00000002
eax: c0189f04 ebx: c7fec480 ecx: c0244aa4 edx: 0000000a
esi: 24000001 edi: 00000082 ebp: c7ffbf18 esp: c7ffbf0c
ds: 0018 es: 0018 ss: 0018
Process swapper (pid: 0, process nr: 1, stackpage=c7ffb000)
Stack: 00000004 c0196a3c c0244a88 c7ffbf38 c010ae17 0000000a c7fec480
c7ffbf74
00000140 c7ff82a0 0000000a c7ffbf58 c0113517 0000000a c7ffbf74
c7ff82a0
c7ffbf74 c7ffa000 c0237720 c7ffbf6c c010af9b 0000000a c7ffbf74
c7ffa000
Call Trace: [<c0196a3c>] [<c010ae17>] [<c0113517>] [<c010af9b>] [<c0109a18>]
[<c
0120018>] [<c0106fa5>]
[<c01116fc>] [<c01116e4>] [<c01072a6>]
Code: f6 03 01 75 fb e9 ed 90 fb ff f6 05 24 15 21 c0 01 75 f7 e9
>>EIP: c01d0e1d <stext_lock+2525/4a08>
Trace: c0196a3c <ide_dma_intr+0/90>
Trace: c010ae17 <handle_IRQ_event+53/88>
Trace: c0113517 <do_level_ioapic_IRQ+63/a4>
Trace: c010af9b <do_IRQ+3b/5c>
Trace: c0109a18 <common_interrupt+18/20>
Code: c01d0e1d <stext_lock+2525/4a08> 00000000 <_EIP>: <===
Code: c01d0e1d <stext_lock+2525/4a08> 0: f6 03 01
testb $0x1,(%ebx) <===
Code: c01d0e20 <stext_lock+2528/4a08> 3: 75 fb
jne 0 <_EIP>
Code: c01d0e22 <stext_lock+252a/4a08> 5: e9 ed 90 fb ff
jmp c0189f14 <ide_intr+10/e4>
Code: c01d0e27 <stext_lock+252f/4a08> a: f6 05 24 15 21 c0 01
testb $0x1,0xc0211524
Code: c01d0e2e <stext_lock+2536/4a08> 11: 75 f7
jne c01d0e27 <stext_lock+252f/4a08>
Code: c01d0e30 <stext_lock+2538/4a08> 13: e9 00 00 00 00
jmp c01d0e35 <stext_lock+253d/4a08>
Aiee, killing interrupt handler
Kernel panic: Attempted to kill the idle task!
CPU: 0
EIP: 0010:[<c010aee0>]
EFLAGS: 00000002
eax: 00000140 ebx: 0000000a ecx: ffffd000 edx: c0220000
esi: c0244be8 edi: c0244bcc ebp: c69f9c7c esp: c69f9c7c
ds: 0018 es: 0018 ss: 0018
Process hdparm (pid: 566, process nr: 43, stackpage=c69f9000)
Stack: c69f9cc0 c01898ec 0000000a c7fec480 c69f8000 c44d0a80 c44d0a80
c0244be8
00000000 c0244be8 00000000 00000283 00000022 c69f9cf0 c69f9cf0
00000046
c0244bcc c69f9cdc c0189bc0 c7fec480 c69f9cd8 00000000 00000246
00000086
Call Trace: [<c01898ec>] [<c0189bc0>] [<c0189c6f>] [<c018742c>] [<c012d02e>]
[<c
01314da>] [<c012c160>]
[<c0108828>] [<c010002b>]
Code: 75 fa 8b 5d fc 89 ec 5d c3 8d 76 00 55 89 e5 57 56 53 8b 55
>>EIP: c010aee0 <disable_irq+34/40>
Trace: c01898ec <ide_do_request+2f0/548>
Trace: c0189bc0 <do_hwgroup_request+48/5c>
Trace: c0189c6f <do_ide3_request+17/2c>
Trace: c018742c <unplug_device+48/5c>
Trace: c012d02e <__wait_on_buffer+c6/140>
Code: c010aee0 <disable_irq+34/40> 00000000 <_EIP>: <===
Code: c010aee0 <disable_irq+34/40> 0: 75 fa
jne c010aedc <disable_irq+30/40> <===
Code: c010aee2 <disable_irq+36/40> 2: 8b 5d fc
movl 0xfffffffc(%ebp),%ebx
Code: c010aee5 <disable_irq+39/40> 5: 89 ec
movl %ebp,%esp
Code: c010aee7 <disable_irq+3b/40> 7: 5d
popl %ebp
Code: c010aee8 <disable_irq+3c/40> 8: c3
ret
Code: c010aee9 <disable_irq+3d/40> 9: 8d 76 00
leal 0x0(%esi),%esi
Code: c010aeec <enable_irq+0/74> c: 55
pushl %ebp
Code: c010aeed <enable_irq+1/74> d: 89 e5
movl %esp,%ebp
Code: c010aeef <enable_irq+3/74> f: 57
pushl %edi
Code: c010aef0 <enable_irq+4/74> 10: 56
pushl %esi
Code: c010aef1 <enable_irq+5/74> 11: 53
pushl %ebx
Code: c010aef2 <enable_irq+6/74> 12: 8b 55 00
movl 0x0(%ebp),%edx
CPU: 0
EIP: 0010:[<c01121b9>]
EFLAGS: 00000002
eax: 00000000 ebx: 00000000 ecx: 00000032 edx: 00000000
esi: 00000000 edi: c69f8000 ebp: c69f9b94 esp: c69f9b9c
ds: 0018 es: 0018 ss: 0018
Process hdparm (pid: 566, process nr: 43, stackpage=c69f9000)
Stack: c69f9bf4 c010a87a 00000000 00000032 ffffffac 00000000 c69f8000
c69f9bf4
00000030 c69f0018 c69f0018 ffffffff c0108998 00000010 00000246
c011d971
00000010 00000246 c70588a0 00000100 c69f8000 c69f8000 c69f9c14
c011dc0a
Call Trace: [<c010a87a>] [<c0108998>] [<c011d971>] [<c011dc0a>] [<c0109383>]
[<c
01d2240>] [<c01089b6>]
[<c010aee0>] [<c01898ec>] [<c0189bc0>] [<c0189c6f>] [<c018742c>]
[<c012d0
[<c0108828>] [<c010002b>]
Code: eb fd 90 eb fe 89 f6 55 89 e5 e8 cc ff ff ff 89 ec 5d c3 55
>>EIP: c01121b9 <stop_this_cpu+25/2c>
Trace: c010a87a <stop_cpu_interrupt+1a/20>
Trace: c0108998 <nmi+0/30>
Trace: c011d971 <exit_notify+24d/264>
Trace: c011dc0a <do_exit+282/2c8>
Trace: c0109383 <do_nmi+73/8c>
Code: c01121b9 <stop_this_cpu+25/2c> 00000000 <_EIP>: <===
Code: c01121b9 <stop_this_cpu+25/2c> 0: eb fd
jmp c01121b8 <stop_this_cpu+24/2c> <===
Code: c01121bb <stop_this_cpu+27/2c> 2: 90
nop
Code: c01121bc <stop_this_cpu+28/2c> 3: eb fe
jmp c01121bc <stop_this_cpu+28/2c>
Code: c01121be <stop_this_cpu+2a/2c> 5: 89 f6
movl %esi,%esi
Code: c01121c0 <smp_stop_cpu_interrupt+0/c> 7: 55
pushl %ebp
Code: c01121c1 <smp_stop_cpu_interrupt+1/c> 8: 89 e5
movl %esp,%ebp
Code: c01121c3 <smp_stop_cpu_interrupt+3/c> a: e8 cc ff ff ff
call c0112194 <stop_this_cpu+0/2c>
Code: c01121c8 <smp_stop_cpu_interrupt+8/c> f: 89 ec
movl %ebp,%esp
Code: c01121ca <smp_stop_cpu_interrupt+a/c> 11: 5d
popl %ebp
Code: c01121cb <smp_stop_cpu_interrupt+b/c> 12: c3
ret
Code: c01121cc <smp_call_function_interrupt+0/3c> 13: 55
pushl %ebp
Regards,
Tom
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/