RE: IDE + SMP Lockup (no OOPS) in 2.2.12, 2.2.10

Tom Livingston (tsl@volition.org)
Wed, 29 Sep 1999 08:13:42 -0700


This is a multi-part message in MIME format.

------=_NextPart_000_008A_01BF0A52.8AD564A0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit

Andre Hedrick wrote:
> I just now have a way to produce the crash as of tonight.
> What I need to know is what parallel races are we referring/hunting and
> against which kernels?

I have seen one race. At this point it appears to me that any time you have
concurrent access to two ide channels sharing the same interrupt while
running SMP. As you see below, I have duplicated with the hpt-366
controller as well.

I have tried:
2.2.6 (with or without ide patch): works in SMP mode
2.2.7 (same): works in SMP (as I remember, haven't tested recently)
2.2.[89] : haven't tested
2.2.10 and later: crashes in SMP mode
2.2.12 +
2.3.18: crashes in SMP mode

It would seem that is actually a generic ide/smp bug, and not one that is
promise controller specific. I was able to cause the same behavior tonight
with the onboard hpt-366 controller.

kernel: 2.2.12 + ide.2.2.12.19990925.patch.gz + 2.2.12-ikd1.gz

Tested normal crashing setup, one drive on each channel of pdc20246. Got
normal ikd NMI oopser oops. Looked like the other one I reported. oops is
attached as text file.

Then I moved the drives, one per channel to the onboard hpt-366. I had
previously commented out the two lines in ide-pci.c at line 630 that say:
if (dev2 && hpt363_shared_irq)
return;
so that I could enable the second channel for the test.

I did my standard crashme (which I have simplified to 'dd if=/dev/hdi
of=/dev/null & dd if=/dev/hdk of=/dev/null' you only need one block each to
trigger the crash) and got what looks like to me the same lockup as the
pdc20246 with the hpt-366. This oops is also attached.

I retested this configuration in UP mode and found it completely stable even
with the 2nd channel disabling removed. My abit bp-6 bios is the original
LP revision, I have never flashed. Board is a "newer" board, bought about
8/1... it has plastic cpu handles as opposed to metal, like my first one
bought in early june. I cannot see any silk screening indicating revision
on the board.

I was thinking that this might have caused your impression that there is a
buggy chipset/revision out there that needs this 2nd channel disabled on the
hpt-366. Or is it two bugs? One is the motherboard, and if that's OK you
still have the multichannel + smp bug?

Again, OOPS from both configurations mime attached.

Tom

------=_NextPart_000_008A_01BF0A52.8AD564A0
Content-Type: text/plain;
name="hpt-decoded-oops-2.2.12.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="hpt-decoded-oops-2.2.12.txt"

CPU: 1=0A=
EIP: 0010:[<c01d3455>]=0A=
EFLAGS: 00000002=0A=
eax: c018a03c ebx: c7fec480 ecx: c0246e3c edx: 0000000b=0A=
esi: 24000001 edi: 00000082 ebp: c7ffbf18 esp: c7ffbf0c=0A=
ds: 0018 es: 0018 ss: 0018=0A=
Process swapper (pid: 0, process nr: 1, stackpage=3Dc7ffb000)=0A=
Stack: 00000004 c0197828 c0246e00 c7ffbf38 c010ae17 0000000b c7fec480 =
c7ffbf74=0A=
00000160 c7ff82a0 0000000b c7ffbf58 c0113517 0000000b c7ffbf74 =
c7ff82a0=0A=
c7ffbf74 c7ffa000 c0239740 c7ffbf6c c010af9b 0000000b c7ffbf74 =
c7ffa000=0A=
Call Trace: [<c0197828>] [<c010ae17>] [<c0113517>] [<c010af9b>] =
[<c0109a18>] [<c=0A=
0120018>] [<c0106fa5>]=0A=
[<c01116fc>] [<c01116e4>] [<c01072a6>]=0A=
Code: f6 03 01 75 fb e9 ed 6b fb ff f6 05 64 46 21 c0 01 75 f7 e9=0A=
=0A=
>>EIP: c01d3455 <stext_lock+255d/4a88>=0A=
Trace: c0197828 <ide_dma_intr+0/90>=0A=
Trace: c010ae17 <handle_IRQ_event+53/88>=0A=
Trace: c0113517 <do_level_ioapic_IRQ+63/a4>=0A=
Trace: c010af9b <do_IRQ+3b/5c>=0A=
Trace: c0109a18 <common_interrupt+18/20>=0A=
Code: c01d3455 <stext_lock+255d/4a88> 00000000 <_EIP>: <=3D=3D=3D=0A=
Code: c01d3455 <stext_lock+255d/4a88> 0: f6 03 01 =
testb $0x1,(%ebx) <=3D=3D=3D=0A=
Code: c01d3458 <stext_lock+2560/4a88> 3: 75 fb =
jne 0 <_EIP>=0A=
Code: c01d345a <stext_lock+2562/4a88> 5: e9 ed 6b fb ff =
jmp c018a04c <ide_intr+10/e4>=0A=
Code: c01d345f <stext_lock+2567/4a88> a: f6 05 64 46 21 c0 =
01 testb $0x1,0xc0214664=0A=
Code: c01d3466 <stext_lock+256e/4a88> 11: 75 f7 =
jne c01d345f <stext_lock+2567/4a88>=0A=
Code: c01d3468 <stext_lock+2570/4a88> 13: e9 00 00 00 00 =
jmp c01d346d <stext_lock+2575/4a88>=0A=
=0A=
Aiee, killing interrupt handler=0A=
Kernel panic: Attempted to kill the idle task!=0A=
CPU: 0=0A=
EIP: 0010:[<c010aee0>]=0A=
EFLAGS: 00000002=0A=
eax: 00000160 ebx: 0000000b ecx: ffffd000 edx: c0222000=0A=
esi: c0246fb4 edi: c0246f78 ebp: c14f7c7c esp: c14f7c7c=0A=
ds: 0018 es: 0018 ss: 0018=0A=
Process dd (pid: 570, process nr: 43, stackpage=3Dc14f7000)=0A=
Stack: c14f7cc0 c0189957 0000000b c7fec480 c14f6000 c6649480 c6649480 =
c0246fb4=0A=
00000000 c0246fb4 00000000 00000216 00000039 c14f7cf0 c14f7cf0 =
00000046=0A=
c0246f78 c14f7cdc c0189c2c c7fec480 c14f7cd8 00000000 00000246 =
00000086=0A=
Call Trace: [<c0189957>] [<c0189c2c>] [<c0189d33>] [<c018742c>] =
[<c012d02e>] [<c=0A=
01314da>] [<c011692b>]=0A=
[<c012dd2d>] [<c0145121>] [<c0139967>] [<c0145cdd>] [<c01336e0>] =
[<c01217=0A=
[<c012c160>] [<c0108828>]=0A=
Code: 75 fa 8b 5d fc 89 ec 5d c3 8d 76 00 55 89 e5 57 56 53 8b 55=0A=
=0A=
>>EIP: c010aee0 <disable_irq+34/40>=0A=
Trace: c0189957 <ide_do_request+2f7/550>=0A=
Trace: c0189c2c <do_hwgroup_request+48/5c>=0A=
Trace: c0189d33 <do_ide5_request+17/2c>=0A=
Trace: c018742c <unplug_device+48/5c>=0A=
Trace: c012d02e <__wait_on_buffer+c6/140>=0A=
Code: c010aee0 <disable_irq+34/40> 00000000 <_EIP>: <=3D=3D=3D=0A=
Code: c010aee0 <disable_irq+34/40> 0: 75 fa =
jne c010aedc <disable_irq+30/40> <=3D=3D=3D=0A=
Code: c010aee2 <disable_irq+36/40> 2: 8b 5d fc =
movl 0xfffffffc(%ebp),%ebx=0A=
Code: c010aee5 <disable_irq+39/40> 5: 89 ec =
movl %ebp,%esp=0A=
Code: c010aee7 <disable_irq+3b/40> 7: 5d =
popl %ebp=0A=
Code: c010aee8 <disable_irq+3c/40> 8: c3 =
ret =0A=
Code: c010aee9 <disable_irq+3d/40> 9: 8d 76 00 =
leal 0x0(%esi),%esi=0A=
Code: c010aeec <enable_irq+0/74> c: 55 =
pushl %ebp=0A=
Code: c010aeed <enable_irq+1/74> d: 89 e5 =
movl %esp,%ebp=0A=
Code: c010aeef <enable_irq+3/74> f: 57 =
pushl %edi=0A=
Code: c010aef0 <enable_irq+4/74> 10: 56 =
pushl %esi=0A=
Code: c010aef1 <enable_irq+5/74> 11: 53 =
pushl %ebx=0A=
Code: c010aef2 <enable_irq+6/74> 12: 8b 55 00 =
movl 0x0(%ebp),%edx=0A=
=0A=
CPU: 0=0A=
EIP: 0010:[<c01121b9>]=0A=
EFLAGS: 00000002=0A=
eax: 00000000 ebx: c7ff8220 ecx: 00000001 edx: 00000000=0A=
esi: 00000001 edi: 00000000 ebp: c14f7b18 esp: c14f7b20=0A=
ds: 0018 es: 0018 ss: 0018=0A=
Process dd (pid: 570, process nr: 43, stackpage=3Dc14f7000)=0A=
Stack: c14f7b74 c010a87a c7ff8220 00000001 00000001 00000001 00000000 =
c14f7b74=0A=
00000000 00000018 c14f0018 ffffffff c0108998 00000010 00000246 =
c010ae04=0A=
00000010 00000246 00000020 c7ff8220 c0222020 c14f7b94 c0113483 =
00000001=0A=
Call Trace: [<c010a87a>] [<c0108998>] [<c010ae04>] [<c0113483>] =
[<c010af9b>] [<c=0A=
0109a18>] [<c011d971>]=0A=
[<c011dc0a>] [<c0109383>] [<c01d48c0>] [<c01089b6>] [<c010aee0>] =
[<c01899=0A=
[<c018742c>] [<c012d02e>] [<c01314da>] [<c011692b>] [<c012dd2d>] =
[<c01451=0A=
[<c01336e0>] [<c01217ec>] [<c0115ef0>] [<c012c274>] [<c012c160>] =
[<c01088=0A=
Code: eb fd 90 eb fe 89 f6 55 89 e5 e8 cc ff ff ff 89 ec 5d c3 55=0A=
=0A=
>>EIP: c01121b9 <stop_this_cpu+25/2c>=0A=
Trace: c010a87a <stop_cpu_interrupt+1a/20>=0A=
Trace: c0108998 <nmi+0/30>=0A=
Trace: c010ae04 <handle_IRQ_event+40/88>=0A=
Trace: c0113483 <do_edge_ioapic_IRQ+73/a4>=0A=
Trace: c010af9b <do_IRQ+3b/5c>=0A=
Code: c01121b9 <stop_this_cpu+25/2c> 00000000 <_EIP>: <=3D=3D=3D=0A=
Code: c01121b9 <stop_this_cpu+25/2c> 0: eb fd =
jmp c01121b8 <stop_this_cpu+24/2c> <=3D=3D=3D=0A=
Code: c01121bb <stop_this_cpu+27/2c> 2: 90 =
nop =0A=
Code: c01121bc <stop_this_cpu+28/2c> 3: eb fe =
jmp c01121bc <stop_this_cpu+28/2c>=0A=
Code: c01121be <stop_this_cpu+2a/2c> 5: 89 f6 =
movl %esi,%esi=0A=
Code: c01121c0 <smp_stop_cpu_interrupt+0/c> 7: 55 =
pushl %ebp=0A=
Code: c01121c1 <smp_stop_cpu_interrupt+1/c> 8: 89 e5 =
movl %esp,%ebp=0A=
Code: c01121c3 <smp_stop_cpu_interrupt+3/c> a: e8 cc ff ff ff =
call c0112194 <stop_this_cpu+0/2c>=0A=
Code: c01121c8 <smp_stop_cpu_interrupt+8/c> f: 89 ec =
movl %ebp,%esp=0A=
Code: c01121ca <smp_stop_cpu_interrupt+a/c> 11: 5d =
popl %ebp=0A=
Code: c01121cb <smp_stop_cpu_interrupt+b/c> 12: c3 =
ret =0A=
Code: c01121cc <smp_call_function_interrupt+0/3c> 13: 55 =
pushl %ebp=0A=
=0A=
Aiee, killing interrupt handler=0A=
Unable to handle kernel NULL pointer dereference at virtual address =
00000000=0A=
=0A=
Then there is a series of the null pointer oopses=0A=

------=_NextPart_000_008A_01BF0A52.8AD564A0
Content-Type: text/plain;
name="pdc-decoded-oops-2.2.12.txt"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
filename="pdc-decoded-oops-2.2.12.txt"

CPU: 1=0A=
EIP: 0010:[<c01d3458>]=0A=
EFLAGS: 00000002=0A=
eax: c018a03c ebx: c7fec480 ecx: c0246b4c edx: 0000000a=0A=
esi: 24000001 edi: 00000082 ebp: c7ffbf18 esp: c7ffbf0c=0A=
ds: 0018 es: 0018 ss: 0018=0A=
Process swapper (pid: 0, process nr: 1, stackpage=3Dc7ffb000)=0A=
Stack: 00000004 c0197828 c0246b10 c7ffbf38 c010ae17 0000000a c7fec480 =
c7ffbf74=0A=
00000140 c7ff82a0 0000000a c7ffbf58 c0113517 0000000a c7ffbf74 =
c7ff82a0=0A=
c7ffbf74 c7ffa000 c0239740 c7ffbf6c c010af9b 0000000a c7ffbf74 =
c7ffa000=0A=
Call Trace: [<c0197828>] [<c010ae17>] [<c0113517>] [<c010af9b>] =
[<c0109a18>] [<c=0A=
0120018>] [<c0106fa5>]=0A=
[<c01116fc>] [<c01116e4>] [<c01072a6>]=0A=
Code: 75 fb e9 ed 6b fb ff f6 05 64 46 21 c0 01 75 f7 e9 d7 6d fb=0A=
=0A=
>>EIP: c01d3458 <stext_lock+2560/4a88>=0A=
Trace: c0197828 <ide_dma_intr+0/90>=0A=
Trace: c010ae17 <handle_IRQ_event+53/88>=0A=
Trace: c0113517 <do_level_ioapic_IRQ+63/a4>=0A=
Trace: c010af9b <do_IRQ+3b/5c>=0A=
Trace: c0109a18 <common_interrupt+18/20>=0A=
Code: c01d3458 <stext_lock+2560/4a88> 00000000 <_EIP>: <=3D=3D=3D=0A=
Code: c01d3458 <stext_lock+2560/4a88> 0: 75 fb =
jne c01d3455 <stext_lock+255d/4a88> <=3D=3D=3D=0A=
Code: c01d345a <stext_lock+2562/4a88> 2: e9 ed 6b fb ff =
jmp c018a04c <ide_intr+10/e4>=0A=
Code: c01d345f <stext_lock+2567/4a88> 7: f6 05 64 46 21 c0 =
01 testb $0x1,0xc0214664=0A=
Code: c01d3466 <stext_lock+256e/4a88> e: 75 f7 =
jne c01d345f <stext_lock+2567/4a88>=0A=
Code: c01d3468 <stext_lock+2570/4a88> 10: e9 d7 6d fb 00 =
jmp c118a244 <END_OF_CODE+f42160/????>=0A=
=0A=
Aiee, killing interrupt handler=0A=
Kernel panic: Attempted to kill the idle task!=0A=
CPU: 0=0A=
EIP: 0010:[<c010aedc>]=0A=
EFLAGS: 00000002=0A=
eax: 00000140 ebx: 0000000a ecx: ffffd000 edx: c0222000=0A=
esi: c0246cc4 edi: c0246c88 ebp: c1337c7c esp: c1337c7c=0A=
ds: 0018 es: 0018 ss: 0018=0A=
Process dd (pid: 571, process nr: 43, stackpage=3Dc1337000)=0A=
Stack: c1337cc0 c0189957 0000000a c7fec480 c1336000 c1677f20 c1677f20 =
c0246cc4=0A=
00000000 c0246cc4 00000000 00000283 00000022 c1337cf0 c1337cf0 =
00000046=0A=
c0246c88 c1337cdc c0189c2c c7fec480 c1337cd8 00000000 00000246 =
00000086=0A=
Call Trace: [<c0189957>] [<c0189c2c>] [<c0189cdb>] [<c018742c>] =
[<c012d02e>] [<c=0A=
01314da>] [<c011692b>]=0A=
[<c012dd2d>] [<c0145121>] [<c0139967>] [<c0145cdd>] [<c01336e0>] =
[<c01217=0A=
[<c0117751>] [<c012c274>] [<c012c160>] [<c0108828>]=0A=
Code: f6 04 10 01 75 fa 8b 5d fc 89 ec 5d c3 8d 76 00 55 89 e5 57=0A=
=0A=
>>EIP: c010aedc <disable_irq+30/40>=0A=
Trace: c0189957 <ide_do_request+2f7/550>=0A=
Trace: c0189c2c <do_hwgroup_request+48/5c>=0A=
Trace: c0189cdb <do_ide3_request+17/2c>=0A=
Trace: c018742c <unplug_device+48/5c>=0A=
Trace: c012d02e <__wait_on_buffer+c6/140>=0A=
Code: c010aedc <disable_irq+30/40> 00000000 <_EIP>: <=3D=3D=3D=0A=
Code: c010aedc <disable_irq+30/40> 0: f6 04 10 01 =
testb $0x1,(%eax,%edx,1) <=3D=3D=3D=0A=
Code: c010aee0 <disable_irq+34/40> 4: 75 fa =
jne 0 <_EIP>=0A=
Code: c010aee2 <disable_irq+36/40> 6: 8b 5d fc =
movl 0xfffffffc(%ebp),%ebx=0A=
Code: c010aee5 <disable_irq+39/40> 9: 89 ec =
movl %ebp,%esp=0A=
Code: c010aee7 <disable_irq+3b/40> b: 5d =
popl %ebp=0A=
Code: c010aee8 <disable_irq+3c/40> c: c3 =
ret =0A=
Code: c010aee9 <disable_irq+3d/40> d: 8d 76 00 =
leal 0x0(%esi),%esi=0A=
Code: c010aeec <enable_irq+0/74> 10: 55 =
pushl %ebp=0A=
Code: c010aeed <enable_irq+1/74> 11: 89 e5 =
movl %esp,%ebp=0A=
Code: c010aeef <enable_irq+3/74> 13: 57 =
pushl %edi=0A=
=0A=
CPU: 0=0A=
EIP: 0010:[<c01121b9>]=0A=
EFLAGS: 00000002=0A=
eax: 00000000 ebx: 00000000 ecx: 00000028 edx: 00000000=0A=
esi: 00000000 edi: c1336000 ebp: c1337b94 esp: c1337b9c=0A=
ds: 0018 es: 0018 ss: 0018=0A=
Process dd (pid: 571, process nr: 43, stackpage=3Dc1337000)=0A=
Stack: c1337bf4 c010a87a 00000000 00000028 ffffffac 00000000 c1336000 =
c1337bf4=0A=
00000030 c1330018 c1330018 ffffffff c0108998 00000010 00000246 =
c011d971=0A=
00000010 00000246 c09a58a0 00000100 c1336000 c1336000 c1337c14 =
c011dc0a=0A=
Call Trace: [<c010a87a>] [<c0108998>] [<c011d971>] [<c011dc0a>] =
[<c0109383>] [<c=0A=
01d48c0>] [<c01089b6>]=0A=
[<c010aedc>] [<c0189957>] [<c0189c2c>] [<c0189cdb>] [<c018742c>] =
[<c012d0=0A=
[<c012dd2d>] [<c0145121>] [<c0139967>] [<c0145cdd>] [<c01336e0>] =
[<c01217=0A=
[<c0117751>] [<c012c274>] [<c012c160>] [<c0108828>]=0A=
Code: eb fd 90 eb fe 89 f6 55 89 e5 e8 cc ff ff ff 89 ec 5d c3 55=0A=
=0A=
>>EIP: c01121b9 <stop_this_cpu+25/2c>=0A=
Trace: c010a87a <stop_cpu_interrupt+1a/20>=0A=
Trace: c0108998 <nmi+0/30>=0A=
Trace: c011d971 <exit_notify+24d/264>=0A=
Trace: c011dc0a <do_exit+282/2c8>=0A=
Trace: c0109383 <do_nmi+73/8c>=0A=
Code: c01121b9 <stop_this_cpu+25/2c> 00000000 <_EIP>: <=3D=3D=3D=0A=
Code: c01121b9 <stop_this_cpu+25/2c> 0: eb fd =
jmp c01121b8 <stop_this_cpu+24/2c> <=3D=3D=3D=0A=
Code: c01121bb <stop_this_cpu+27/2c> 2: 90 =
nop =0A=
Code: c01121bc <stop_this_cpu+28/2c> 3: eb fe =
jmp c01121bc <stop_this_cpu+28/2c>=0A=
Code: c01121be <stop_this_cpu+2a/2c> 5: 89 f6 =
movl %esi,%esi=0A=
Code: c01121c0 <smp_stop_cpu_interrupt+0/c> 7: 55 =
pushl %ebp=0A=
Code: c01121c1 <smp_stop_cpu_interrupt+1/c> 8: 89 e5 =
movl %esp,%ebp=0A=
Code: c01121c3 <smp_stop_cpu_interrupt+3/c> a: e8 cc ff ff ff =
call c0112194 <stop_this_cpu+0/2c>=0A=
Code: c01121c8 <smp_stop_cpu_interrupt+8/c> f: 89 ec =
movl %ebp,%esp=0A=
Code: c01121ca <smp_stop_cpu_interrupt+a/c> 11: 5d =
popl %ebp=0A=
Code: c01121cb <smp_stop_cpu_interrupt+b/c> 12: c3 =
ret =0A=
Code: c01121cc <smp_call_function_interrupt+0/3c> 13: 55 =
pushl %ebp=0A=
=0A=
=0A=

------=_NextPart_000_008A_01BF0A52.8AD564A0--

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/