Re: 2.1.91pre2 stable?

F Harvell (fharvell@fts.net)
Mon, 30 Mar 1998 20:23:58 -0500


On Sun, 29 Mar 1998 12:14:53 PST, Linus Torvalds wrote:
>
> On Sun, 29 Mar 1998, Bill Broadhurst wrote:
> >
> > What I use to test this is a complete rebuild of the entire X11 suite
> > which takes over an hour on this machine. I run it on a standard VT
> > without anything else running (except top but it crashes without top
> > running) and no X.
> >
> > If the processes are shorter the system seems fine unless I really
> > load the machine. I can reproduce the same thing by repeatedly
> > running kernel builds on 6 VT's at the same time but it takes longer
> > that way. I just ran 30 consecutive builds on two VT's without any
> > trouble.
> >
> > This has been happening since 2.1.88 at least. I don't remember
> > seeing it before that but I went from around .72 or so right to .88 in
> > a single jump.
> >
> > How do I find out what causes this?
>
> Could you please pinpoint which releae this started in, that would
> certainly help a _lot_. By using a reasonable binary-search kind of
> algorithm it shouldn't take all that long if you can indeed reproduce it
> fairly easily and reliably.

For me, my system became unusable at 2.1.85. Prior to that kernel
(I am currently running 2.1.84) I have witnessed processes stuck in
the "D" state, but they were rare. With the 2.1.85 kernel, just
running my nightly dump would leave processes stuck. The problem
appears to be exacerbated by the performance increases provided by
Ingo's IO-APIC code starting in 2.1.85. Note: the first kernel that
I saw a stuck "D" process in was 2.1.36 during an mkisofs run.
(Apparently, it got stuck on the incorporation of a large,
approximately 30MB if memory serves, file. When I removed the file,
I was able to get the mkisofs run to complete.)

The stuck processes appear to occur most frequently when there is
significant SCSI I/O within the process. I am able to readily
recreate the problem by performing a level 0 dump to SCSI tape of a
partition and doing an fsck on another unmounted partition. Either
the fsck or the dump will hang. Note, over time, especially with the
2.1.85 and 2.1.86 kernels, I have seen other processes hang. It is
especially bad when update (bdflush) hangs. The system will go into
a slow spiral and eventually crash.

I tried the 2.1.91 kernel on Saturday, hoping the new SCSI spinlock
code would help, and still experienced the problem. I tried to
rebuild the kernel with the __asm__ "cpuid" hack but, on reboot, the
system hung hard. It took me over an hour to fsck my partitions.
I'm skating the razor's edge when the system crashes as, even with
the 2.1.84 kernel, there is a very good likelyhood that the boot fsck
on an unclean partition will hang. It took 4 reboots to make it
through last time! I didn't have the time (or heart ;) to try to
reboot the modified kernel again.

My system is on the internet and I am more than willing to let
someone more knowledgeable than I on the box to investigate. BTW, my
system has the tape drive on a buslogic 946 and two micropolis 8.7G
drives in 10 md striped partitions on a second buslogic 958.

Some other possible information, I have received messages from
others that a) upgrading the adaptec driver helped their problem (for
a change, the adaptec appears to be the better card ;) and b) that
turning off read-ahead on the drives helped. I have not tried either
as a) I have buslogic cards, and b) I'm not real sure how to do this.

Anything I can do to help, please let me know. I _really_ want to start using Ingo's code.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu