Debugging internal looping in kernel

Paul Gortmaker (gpg109@rsphy6.anu.edu.au)
Thu, 13 Jun 1996 16:45:25 +1000 (EST)


I have been trying to find what appears to be an internal kernel loop
that happens quite rarely. I think I first saw it around pre-11, and
it still happens with v2.00. What happens is while running X and
NetScrape over CSLIP, suddenly the mouse will stop tracking, and the
load will instantaneously shoot up to about 2 or 2.5. This always
happens when using the RHS scroll bar during or shortly after a
www page (usually from www.linux.ncm.com) comes in.

After about 30 secs, you will see the mouse pointer move slowly
around a bit, corresponding to your input from about 30 seconds ago,
and then things crawl to a stop completely.

After anywhere from 1 to 3 minutes, whatever is looping stops, and
the system always comes back to normal. No kernel printk's -- nothing.
While it is in this state, you can ping the machine, and even connect
to it (telnet gives you the connected to message and that is it) so the
ethercard is still taking interrupts, and net_bh() is run. However,
no disk activity seems to take place while it is in this state.

[this is 2.00, a.out -m486, gcc-2.7.2, SCSI root fs via BusLogic, /home on
IDE disk, no swap presently in use, XF86_S3 v3.1.2 16bpp, NetScrape 2.0]

This seems to happen about once per 2 days, and only while using NetScrape
over CSLIP. It has only happened about 5 times so far in total. The first
few times, I tried to switch out of X so I could hit <Alt><Scroll> and
get the EIP, but it would only switch out after the looping stopped,
leaving you with the EIP of sys_idle :-(

Not to be outdone, I hacked calc_load() to set a global flag once
the load was > 1, and hacked keyboard_interrupt() to call show_ptregs()
if this flag was set, effectively allowing me to do the <Alt><Scroll>
under X just by hitting any key once the load was > 1. I also added
"jiffies" to the output of show_ptregs() so I could assign a timescale
to events.

With this in place, I tapped a key once every second or so the next
today when it happened, and was able to get a dump of a few EIPs. Here
is a snippet (I lost some because I guess syslog couldn't wake up and
store them to disk -- I will be ready with a 0.5MB "dmesg" buffer
next time B-) Note that 0x111b68 to c67 is update_wall_time_one_tick.

-------------------------------------------------------------
EIP: 0010:[<00111c48>] EFLAGS: 00000246
EAX: 0002bf20 EBX: 00000000 ECX: 01299d58 EDX: 00000000
ESI: 01299d58 EDI: 00000021 EBP: 0174befc DS: 0018 ES: 0018 FS: 002b GS: 002b
Timer Count: 21739696

EIP: 0010:[<00111b69>] EFLAGS: 00000206
EAX: 0002bf20 EBX: 0000001d ECX: 01299d58 EDX: 00000000
ESI: 01299d58 EDI: 00000021 EBP: 0174bf08 DS: 0018 ES: 0018 FS: 002b GS: 002b
Timer Count: 21739706

EIP: 0010:[<00111c06>] EFLAGS: 00000202
EAX: 00030d40 EBX: 00000000 ECX: 01299d58 EDX: 00000000
ESI: 01299d58 EDI: 00000021 EBP: 0174befc DS: 0018 ES: 0018 FS: 002b GS: 002b
Timer Count: 21739716
-------------------------------------------------------------

There are lots of these, and all 10 jiffies apart, which is suspect, as
I was not generating keyboard events every 0.1 seconds. Most of the EIP
are in update_wall_time_one_tick as seen below.

-------------------------------------------------------------
foobar# grep EIP /tmp/loop-log | wc
59 531 4189
foobar# grep EIP /tmp/loop-log | grep 00111[bc] | wc
46 414 3266
foobar#
-------------------------------------------------------------

I am not sure that update_wall_time is the problem, instead of a symptom.
If something bad happened that caused lost_ticks to become rather large
(as opposed to the expected value of 1) then update_wall_time_one_tick
will be called a whole *bunch* of times in a row.

Looking at the asm, you see that it is called with ticks in %ebx, and
it remains in %ebx until getting clobbered by the 3rd instruction of
update_wall_time_one_tick. Applying this information to the logs for
EIP >= 0x111b68 & < 0x111b6a, and I see lost_ticks (%ebx) values of 0x1d,
0x1b, 0x0d, 0x17, 0x0f, 0x07, 0x71, and 0x67. I'd say 0x71 is a lot.

I'd suspect that something is leaving interrupts off for a long time,
and as soon as they are enabled, the timer interrupt happens (since
IRQ 0 has higher priority than keybd on IRQ 1). Then my keyboard interrupt
gets serviced, and show_ptregs() catches the timer_bh trying to clean up
the mess that it found with lots of lost ticks. Thus I suspect my attempt
at catching the culprit with show_ptregs() was doomed to fail.

Hrrm, debugging intermittent brain-seizures is such fun. :-(

Paul.