IDE + SMP lockups solved (for me: dual P3, promise66, 37GB IBM drives, RAID5)

thx@rivalnet.de
Sat, 2 Oct 1999 23:45:48 +0200


Like some other people, I also experienced lockups with EIDE drives and SMP.

If I understand the current situation right, there are two IDE+SMP problems right now (Kernel 2.2.12):

(1) lockups with two IDE controllers on the same interrupt
(2) general lockups, even with all controllers (or drives) on different interrupts

Supposedly (2) only happens with two or more disk drives, eventually only if on two different controllers. For (1), obviously, one needs two drives on two controllers, both on the same interrupt (as with promise controllers).

I experienced both problems with 4 EIDE drives, all on promise controllers.
I use Ingo Molnarīs raid patches. But I experienced the problems without the raid patches, too.

An easy method to reproduce (1) seems to be "hdparm -t <first drive> & hdparm -t <second drive>" (at least this reproduces it reliably for me).

I could reproduce (2) by starting the raid5 reconstruction process (mkraid). After a few minutes, the machine reliably was dead (obviously since raid5 reconstruction bangs hard on all drives simultaneously; anyway: reliable bugs are good bugs).

Amazingly, the solution for both problems seems to be Tom Livingstonīs patch. I.e. it not only resolves (1) (which can also be resolved by not using two controllers on the same interrupt), but it also solves (2). For me, at least.

Unfortunately, my current kernel is not really representative or clean (I need a number of patches).
It is:
- based on 2.2.13pre14
- patched with Andre Hedricks IDE patch (19990925, for promise UDMA66)
- patched with Ingo Molnarīs raid5 patch
- patched with Andries Brouwerīs large disk patch (for 2.3.14, considered as "probably unusable" by him, but amazingly patches well for 2.2.13pre14, and even works, however needs >>append = "hde=4650,255,63...<< to be added to lilo.conf for the IBM drives)
- patched with Tom Livingstons IDE-SMP ide.c patch

However, as far as I can see, the IDE+SMP lockup problems donīt seem to be specific to this configuration.

Until now, the machine works well. It continuously getīs stressed by a test program, banging on the raid partition formed of the four IBM 37GB drives, with 30 processes read/write concurrently. For many hours now. Still no problem. As it seems, Tomīs patch solves the lockup problems.

More remarks:

Alan:
Tomīs patch might eventually be something for 2.2.13 final. Itīs risky, but if ifdefīd for SMP only, it at least doesnīt affect non-SMP people, and it may make SMP users with several drives (or controllers?) very happy.
Or, alternatively: a big fat kernel boot warning if SMP and several drives are detected.
For large drives: Andriesī patch is large, risky, and obviously not for general consumption. Probably another fat ugly boot warning might be useful for people trying to get very large/37GB drives working.

Andre:
I vote to include Tomīs patch into your Unified IDE. It at least makes me (SMP-wise) very happy.

Tom:
the server is still banging on itīs drives, without crashing. Thanks a lot for your miraculous patch!

If anyone else needs the patches required for a bleeding edge configuration similar to mine: mail me and Iīll send them.

About the system configuration:
kernel dmesg (more info available if needed):

Linux version 2.2.13pre14 (root@rival) (gcc driver version egcs-2.91.66 19990314/Linux (egcs-1.1.2 release) executing gcc version 2.7.2.3) #6 SMP Sat Oct 2 16:35:05 MEST 1999
Intel MultiProcessor Specification v1.1
Virtual Wire compatibility mode.
OEM ID: OEM00000 Product ID: PROD00000000 APIC at: 0xFEE00000
Processor #0 Pentium(tm) Pro APIC version 17
Processor #1 Pentium(tm) Pro APIC version 17
I/O APIC #2 Version 17 at 0xFEC00000.
Processors: 2
mapped APIC to ffffe000 (fee00000)
mapped IOAPIC to ffffd000 (fec00000)
Detected 448804430 Hz processor.
ide_setup: hde=4560,255,63
ide_setup: hdf=4560,255,63
ide_setup: hdi=4560,255,63
ide_setup: hdj=4560,255,63
Console: colour VGA+ 80x50
Calibrating delay loop... 447.28 BogoMIPS
Memory: 776436k/786368k available (1252k kernel code, 420k reserved, 8188k data, 72k init)
Pentium-III serial number disabled.
Checking 386/387 coupling... OK, FPU using exception 16 error reporting.
Checking 'hlt' instruction... OK.
POSIX conformance testing by UNIFIX
mtrr: v1.35a (19990819) Richard Gooch (rgooch@atnf.csiro.au)
Pentium-III serial number disabled.
per-CPU timeslice cutoff: 100.17 usecs.
CPU0: Intel Pentium III (Katmai) stepping 03
calibrating APIC timer ...
..... CPU clock speed is 448.8218 MHz.
..... system bus clock speed is 99.7380 MHz.
Booting processor 1 eip 2000
Calibrating delay loop... 447.28 BogoMIPS
Pentium-III serial number disabled.
OK.
CPU1: Intel Pentium III (Katmai) stepping 03
Total of 2 processors activated (894.57 BogoMIPS).
enabling symmetric IO mode... ...done.
ENABLING IO-APIC IRQs
init IO_APIC IRQs
IO-APIC pin 0, 18, 19, 20, 21, 22, 23 not connected.
number of MP IRQ sources: 21.
number of IO-APIC registers: 24.
testing the IO APIC.......................
.... register #00: 02000000
....... : physical APIC id: 02
.... register #01: 00170011
....... : max redirection entries: 0017
....... : IO APIC version: 0011
.... register #02: 00000000
....... : arbitration: 00
.... IRQ redirection table:
NR Log Phy Mask Trig IRR Pol Stat Dest Deli Vect:
00 000 00 1 0 0 0 0 0 0 00
01 000 00 0 0 0 0 0 1 1 59
02 0FF 0F 0 0 0 0 0 1 1 51
03 000 00 0 0 0 0 0 1 1 61
04 000 00 0 0 0 0 0 1 1 69
05 000 00 0 0 0 0 0 1 1 71
06 000 00 0 0 0 0 0 1 1 79
07 000 00 0 0 0 0 0 1 1 81
08 000 00 0 0 0 0 0 1 1 89
09 000 00 0 0 0 0 0 1 1 91
0a 000 00 0 0 0 0 0 1 1 99
0b 000 00 0 0 0 0 0 1 1 A1
0c 000 00 0 0 0 0 0 1 1 A9
0d 000 00 1 0 0 0 0 0 0 00
0e 000 00 0 0 0 0 0 1 1 B1
0f 000 00 0 0 0 0 0 1 1 B9
10 0FF 0F 1 1 0 1 0 1 1 C1
11 0FF 0F 1 1 0 1 0 1 1 C9
12 000 00 1 0 0 0 0 0 0 00
13 000 00 1 0 0 0 0 0 0 00
14 000 00 1 0 0 0 0 0 0 00
15 000 00 1 0 0 0 0 0 0 00
16 000 00 1 0 0 0 0 0 0 00
17 000 00 1 0 0 0 0 0 0 00
IRQ to pin mappings:
IRQ0 -> 2
IRQ1 -> 1
IRQ3 -> 3
IRQ4 -> 4
IRQ5 -> 5
IRQ6 -> 6
IRQ7 -> 7
IRQ8 -> 8
IRQ9 -> 9
IRQ10 -> 10
IRQ11 -> 11
IRQ12 -> 12
IRQ13 -> 13
IRQ14 -> 14
IRQ15 -> 15
IRQ16 -> 16
IRQ17 -> 17
.................................... done.
PCI: PCI BIOS revision 2.10 entry at 0xfb1f0
PCI: Using configuration type 1
PCI: Probing PCI hardware
PCI->APIC IRQ transform: (B0,I8,P0) -> 16
PCI->APIC IRQ transform: (B0,I9,P0) -> 17
PCI->APIC IRQ transform: (B0,I12,P0) -> 16
PCI->APIC IRQ transform: (B1,I0,P0) -> 16
Linux NET4.0 for Linux 2.2
Based upon Swansea University Computer Society NET3.039
NET4: Unix domain sockets 1.0 for Linux NET4.0.
NET4: Linux TCP/IP 1.0 for NET4.0
IP Protocols: ICMP, UDP, TCP
Initializing RT netlink socket
Starting kswapd v 1.5
Detected PS/2 Mouse Port.
Serial driver version 4.27 with no serial options enabled
ttyS00 at 0x03f8 (irq = 4) is a 16550A
pty: 256 Unix98 ptys configured
Real Time Clock Driver v1.09
Uniform Multi-Platform E-IDE driver Revision: 6.20
PIIX4: IDE controller on PCI bus 00 dev 39
PIIX4: not 100% native mode: will probe irqs later
ide0: BM-DMA at 0xf000-0xf007, BIOS settings: hda:pio, hdb:pio
ide1: BM-DMA at 0xf008-0xf00f, BIOS settings: hdc:pio, hdd:pio
PDC20262: IDE controller on PCI bus 00 dev 40
PDC20262: not 100% native mode: will probe irqs later
PDC20262: ROM enabled at 0xe7000000
PDC20262: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
ide2: BM-DMA at 0xb400-0xb407, BIOS settings: hde:pio, hdf:pio
ide3: BM-DMA at 0xb408-0xb40f, BIOS settings: hdg:pio, hdh:pio
PDC20262: IDE controller on PCI bus 00 dev 48
PDC20262: not 100% native mode: will probe irqs later
PDC20262: ROM enabled at 0xe8000000
PDC20262: (U)DMA Burst Bit ENABLED Primary PCI Mode Secondary PCI Mode.
ide4: BM-DMA at 0xc800-0xc807, BIOS settings: hdi:pio, hdj:pio
ide5: BM-DMA at 0xc808-0xc80f, BIOS settings: hdk:pio, hdl:pio
hda: IBM-DJNA-352500, ATA DISK drive
hde: IBM-DPTA-353750, ATA DISK drive
hdf: IBM-DPTA-353750, ATA DISK drive
hdi: IBM-DPTA-353750, ATA DISK drive
hdj: IBM-DPTA-353750, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
ide2 at 0xa400-0xa407,0xa802 on irq 16
ide4 at 0xb800-0xb807,0xbc02 on irq 17
hda: IBM-DJNA-352500, 24405MB w/1966kB Cache, CHS=3111/255/63
hde: IBM-DPTA-353750, 35772MB w/1961kB Cache, CHS=4560/255/63
hdf: IBM-DPTA-353750, 35772MB w/1961kB Cache, CHS=4560/255/63
hdi: IBM-DPTA-353750, 35772MB w/1961kB Cache, CHS=4560/255/63
hdj: IBM-DPTA-353750, 35772MB w/1961kB Cache, CHS=4560/255/63
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
md driver 0.90.0 MAX_MD_DEVS=256, MAX_REAL=12
translucent personality registered
linear personality registered
raid0 personality registered
raid1 personality registered
raid5 personality registered
raid5: measuring checksumming speed
raid5: MMX detected, trying high-speed MMX checksum routines
pII_mmx : 1004.697 MB/sec
p5_mmx : 1062.609 MB/sec
8regs : 771.906 MB/sec
32regs : 471.678 MB/sec
using fastest function: p5_mmx (1062.609 MB/sec)
scsi : 0 hosts.
scsi : detected total.
eth0: Intel EtherExpress Pro 10/100 at 0xcc00, 00:90:27:75:B0:5E, IRQ 16.
Receiver lock-up bug exists -- enabling work-around.
Board assembly 721383-006, Physical connectors present: RJ45
Primary interface chip i82555 PHY #1.
General self-test: passed.
Serial sub-system self-test: passed.
Internal registers self-test: passed.
ROM checksum self-test: passed (0x04f4518b).
md.c: sizeof(mdp_super_t) = 4096
Partition check:
hda: hda1 hda2 hda3
hde: hde1
hdf: hdf1
hdi: hdi1
hdj: hdj1
VFS: Mounted root (ext2 filesystem) readonly.
Freeing unused kernel memory: 72k freed
Adding Swap: 1028152k swap-space (priority -1)

PS: The machine is still running. Letīs bang on it a little more.

--
the online community service for gamers & friends -  http://www.rivalnet.com
* unterstützt über 50 PC-Spiele im Multiplayer-Modus
* Dateien senden & empfangen bis 500 MB am Stück
* Newsgroups, Mail, Chat & mehr

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/