2.2.13pre15 SMP+IDE test summary

thx@rivalnet.de
Wed, 6 Oct 1999 15:43:22 +0200


This is the summary of my SMP+IDE tests, done with several kernels. The important ones however were all done with the same 2.2.13pre15 kernel binary.

The following with 2.2.13pre15 SMP kernel (for applied patches see below):

what didn´t work:
(1) dual P3 machine: kernel up for 10 mins, rm of two large files, result: NULL dereference in unlink() (see other message)
(2) dual P3 machine: stress test on all five drives (on three controllers), result: hard lockup after four hours (no oops etc.)

what works (I moved the 2 promise controllers and the 4 raid hds to a single CPU machine):
(3) dual P3 machine: stress test on onboard-controller-hda only: works for 15 hours, still running
(4) single P3 machine: exactly the same test as (2), but also works for 15 hours, still running

As mentioned, (1) to (4) with the same (SMP) kernel binary.

One more pre15 test:
2.2.13pre15 with Unified IDE 2.2.13pre14-19991003 (two rejects in ide.c, one ok, one probably harmless):
(5) dual P3 machine: NULL deref after 6 hours (i.e. this pre15 kernel survived longest)

Earlier test to check if the dual P3 hardware works reliable with non-SMP kernel:
(6) 2.2.13pre14+raid+large-disk+spinlock-patch-nonSMP: worked for 15 hours, then I stopped it

Earlier tests (all on the dual P3 machine):
(7) 2.2.13pre14+UnifIDE-13pre12-19990925+raid+large-disk+spinlock-patch: worked for 6 hours, then NULL oops
(8) 2.2.13pre14+raid+large-disk: hard lockup after 30 mins
(9) 2.2.13pre14+raid+large-disk+spinlock-patch: hard lockup after 2.5h

I can think of these possible reasons for the SMP problems:

(A) SMP race(s) in IDE driver in original 2.2.13pre15
(B) SMP-deadlock in raid-2.2.11-patch
(C) problem with large disk patch (unlikely)
(D) hardware issue (seems unlikely since all works well with non-SMP kernel or machine)

Anyway, I´ll finally use the single P3 machine for the raid and export it to the SMP machine via NFS.

Probably Red Hat or someone else with enough resources should do some intensive SMP+IDE+RAID testing. Red Hat ships raid anyway, and people are tempted to use IDE drives (think of the "i" in "raid"). If they do so on a SMP machine, this may be a problem.

Further remarks:
- the test program writes pseudo random data with multiple processes and verifies during read-back. In all tests there was no single data error.
- unlink(): I later added a remove() to the test program (after read+verify), but the unlink-oops didn´t occur again with 2.2.13pre15. The oops may not be directly unlink()-related (eventually to process startup/finish, I had a similar oops with an earlier kernel >=2.2.12+UnifiedIDE with many processes terminating, while unlink()s were in progress). This seems not related to multiple drives (assuming that mduponch@cisco.com only uses one hd)
- the oopses are attached to my earlier messages, if necessary I can send more info about hardware etc.

2.2.13pre15 kernel patches:
- raid 2.2.11
- large disk patch (for raid5 IBM 37GB drives)
- small cheat in pci.h to make kernel believe the promise66 controllers are promise33
- tasks.h: 4000
- compiled with gcc 2.7.2.3
- max files increased via proc (but there never were many files concurrently open)
- no unified ide patch for (1) to (4)

hardware:
- SMP, dual p3-450
- 2 x promise UDMA66
- 25GB drive as hda, on onboard controller as single master
- 37GB drives at hde, hdf, hdi, hdj

--
the online community service for gamers & friends -  http://www.rivalnet.com
* unterstützt über 50 PC-Spiele im Multiplayer-Modus
* Dateien senden & empfangen bis 500 MB am Stück
* Newsgroups, Mail, Chat & mehr

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/