Re: new processes very slow on otherwise responsive system (2.1.127 and 128)

Michael H. Warfield (mhw@wittsend.com)
Sun, 15 Nov 1998 11:18:38 -0500 (EST)


Andrea Arcangeli enscribed thusly:
> On Sun, 15 Nov 1998, Simon Kirby wrote:

> >I just tried 2.1.129pre1 + arca-19 (less the lp stuff that got rejected),
> >and I was able to replicate this problem with the kernel compiled to UP.

> You mean that SMP is safe and UP is buggy? I am using SMP over UP machine.
> Do I need to recompile UP?

YES! Definitely recompile UP.

Check out the various related threads (there are now about 8 or so).
Everyone is reporting this problem on SMP disabled builds and several people
have reported "fixing" the problem by going to an SMP build on a
uniprocessor. So far, not a single person has reported this problem on an
SMP build, even on uniprocessor systems. Right now, it looks like I
can confirm that an SMP build on a uniprocessor is behaving but the
non-SMP build is not.

Since 2.1.127pre2 (UP) (which was stable for me) and 2.1.127pre3
(which could not be compiled UP due to gcc seg faulting in sched.c) I was
experiencing what some are now calling the 127/128 flu after times that
ran anywhere from 2 hours to 22 hours. Symptoms always the same. Some
sort of activity which would normally cause a brief but pronounced spike
in the load average would cause the load average to go berzerk (>15).
Top would show processes clocking CPU time but 95% or more time was
"system". You had to wait minutes (10-20 minutes) to start a new shell.

I've been playing process of elimination for days (ever since
2.1.127pre7) trying to isolate the source of the problem. It finally
seems to boil down to something in the SMP disabled build possibly related
to the scheduler (common theory) or interrupts (Alan Cox's thought).

I think I can isolate down the time frame of WHEN it got introduced
to somewhere between 2.1.127pre2 and 2.1.127pre7. As I noted above,
2.1.127pre3 could not be compiled with SMP disabled because gcc would
segfault in sched.c even though there was no code difference in sched.c
between pre2 and pre3. That made me think it was something bizzare in
a macro from a header somewhere. I didn't get pre4, pre5, or pre6 but
pre7 once again compiled with SMP disabled (but maybe not correctly?).
That's when the "flu" showed up. It's been in every SMP-disabled build
I've done since.

On the system that has been crashing most frequently, I finally
built a 2.1.128 kernel with SMP enabled. It has now been up almost 24
hours. Longest up time since 2.1.127pre3. I would hate to make a complete
declaration of success yet, but I'm getting more and more optimistic...
I have even deliberately ran the load average over 10 and it has recovered
within seconds every time...

My personal WAG (Wild Ass Guess) is that something got changed
from 2.1.127pre2 to 2.1.127pre3 that blew gcc up when compiling with
SMP commented out. It got "fixed" between 2.1.127pre3 and 2.1.127pre7
either deliberately or incidentally. I suspect that it didn't get
fixed "right". It compiled once again, but something is still broken.

> >It just happened 30 seconds ago, actually, and lasted for about 3 minutes.
> >During this time, I filled every console I have with current IP values,
> >which I will be looking at shortly to see if I can see where the loop is.

> Note in arca-19 you can also press CTRL+scroll lock and the PC value will
> be the same of wchan in ps lax. This way even if you can' t execute `ps'
> you can get where stalled processes are stalling.

> >BTW...I triggered this with:
> >
> >find usr -type f -print | tmp/fork-test 2
>
> Very well.
>
> >("fork-test" posted on linux-kernel not so long ago)
>
> I just saved it infact.
>
> >And it started to freeze up shortly after 11800 children.

> Good.

> Andrea Arcangeli

Mike

-- 
 Michael H. Warfield    |  (770) 985-6132   |  mhw@WittsEnd.com
  (The Mad Wizard)      |  (770) 925-8248   |  http://www.wittsend.com/mhw/
  NIC whois:  MHW9      |  An optimist believes we live in the best of all
 PGP Key: 0xDF1DD471    |  possible worlds.  A pessimist is sure of it!

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/