Hang in fs/locks.c: posix_locks_deadlock() & posix_lock_file()

(ulmo@q.net)
25 Apr 1996 06:38:17 -0400


Problem: hung system (the only way out is hardware reset)
One possible fix: fix the FIXME (see way below near end)
Kernels: Starting in 1.3.84, still in 1.3.95
(1.3.70, 1.3.80 & 1.3.83 do not exhibit the same problem)
Reproducable: Yes
Environment: a) libc.so.5.3.9 or libc.so.5.2.18,
b) sendmail binary from Slackware 3.0.0 (presumably
compiled with libc-5.0.9 with either kernel 1.2.13 or
kernel 1.3.??)
c) heavy load (enough to cause swapping) such as X,
netscape, emacs -f gnus, xemacs -f w3, make zlilo,
ctlinnd renumber ''
d) Email messages coming in
Symtoms: System runs fine for a number of minutes, until system usage
goes up, and at some point it suddenly freezes:
e) Mouse doesn't respond
f) keyboard only responds to NUM LOCK and the special SCROLL
LOCK features
g) TCP sessions made to bound addresses while system in this state
will be connected but no further activity
h) Sendmail processes pile up before the hang
Clues: Note, I just learned of the ALT functions, so didn't do *nearly* as
much diagnostics as I would have had I known earlier [now rectified
as I reproduced it]):
i) right_ALT-SCROLL_LOCK on 1.3.94: (whitespace flexible)
1) EIP: 0010:[<0012ec24>] EFLAGS:00000206
EAX: 01136414 EBX:006b66d8 ECX:01136414 EDX:006a7018
ESI: 004a2f70 EDI:01136414 EBP:00000001 DS:0018 ES:0018 FS:002b GS:002b
continued pressing, only the following would change from
previous dump:

2) EIP ...ec41 EFLAGS ...287
3) EIP ..38 EFLAGS ...246
EAX..ce3f0c
4) EIP ..2d EFLAGS ...206
EAX..ce3f0c
examining the full range of EIP revealed:
(base 0010:[<0012ec__>]) ...14,1a,1e,24,26,2d,38,3b,41
Looking up 12ec?? in System.map reveals closest functions:

000000000012e940 t posix_locks_deadlock
000000000012e990 t flock_lock_file
000000000012eb7c t posix_lock_file <----------------
000000000012eea8 t locks_alloc_lock

j) right_CTRL-SCROLL_LOCK: Current process: sendmail -58 R
(the Slackware 3.0.0 version -- sendmail 8.6.12 compiled
presumably w/ libc5.0.9/etc)
k) right_SHIFT-SCROLL_LOCK: only total net buffers incremented
about as fast as packets would normally wander in (??), no other
activity noticed

l) Recompiling into sendmail-8.7.5 with a 5.2.18/5.3.9 libc
(forget which) seems to stop the crashes, so far (haven't
seen one yet).

m) when examining the system (after rebooting), system had some
of this error in the syslog:

Apr 24 14:23:41 Q facility=daemon(30) priority=info(6) ftpd[9513]: USER deeptht
Apr 24 14:23:40 Q facility=kern(0) priority=warn(4) priority=warning(4) kernel:
fcntl_setlk() called by process 9513 with broken flock() emulation
Apr 24 14:23:41 Q last message repeated 2 times

... repeated a number of times for many ftp sessions (note
my modified syslog which shows facility & priority).
Recompiling to wu-ftpd-2.4.2-beta-8 fixed that (but I can't
seem to get filename.extension conversions working despite
rebuiling the ~ftp/bin programs with "LDFLAGS=-static ...").

n) 1.3.84, the patch that introduces this bug in this particular
environment, is thankfully rather short; egrep '^---' patch-1.3.84
reveals only 19 lines all included here:

--- v1.3.83/linux/Documentation/digiboard.txt Thu Jan 1 02:00:00 1970
--- v1.3.83/linux/Makefile Wed Apr 3 16:06:54 1996
--- v1.3.83/linux/arch/alpha/defconfig Tue Apr 2 13:32:19 1996
--- v1.3.83/linux/arch/alpha/kernel/bios32.c Sun Mar 31 00:13:16 1996
--- v1.3.83/linux/drivers/char/pcxx.c Wed Apr 3 16:06:55 1996
--- v1.3.83/linux/drivers/net/3c509.c Tue Apr 2 13:32:20 1996
--- v1.3.83/linux/fs/locks.c Fri Feb 9 17:01:55 1996
--- v1.3.83/linux/include/asm-alpha/atomic.h Sun Mar 24 12:09:36 1996
--- v1.3.83/linux/include/asm-alpha/hwrpb.h Wed Apr 3 16:06:56 1996
--- v1.3.83/linux/include/asm-alpha/posix_types.h Wed Mar 27 08:19:29 1996
--- v1.3.83/linux/include/asm-i386/atomic.h Wed Mar 20 14:38:42 1996
--- v1.3.83/linux/include/asm-i386/system.h Wed Apr 3 16:06:57 1996
--- v1.3.83/linux/include/linux/skbuff.h Mon Mar 25 10:22:51 1996
--- v1.3.83/linux/kernel/sched.c Wed Apr 3 16:06:57 1996
--- v1.3.83/linux/mm/kmalloc.c Mon Mar 18 07:58:04 1996
--- v1.3.83/linux/mm/memory.c Wed Apr 3 16:06:57 1996
--- v1.3.83/linux/net/core/skbuff.c Tue Apr 2 13:32:24 1996
--- v1.3.83/linux/net/ipv4/tcp_input.c Wed Apr 3 16:06:57 1996
--- v1.3.83/linux/net/netsyms.c Sun Mar 31 00:13:19 1996

o) Brainstorming:

o.1) Since I only learned of right_ALT-SCROLL_LOCK immediately
before reproducing the error in 1.3.94 (where it was in
function posix_lock_file, fs/locks.c), and the patch
that introduced the bug was 1.3.84, presumably its
change to fs/locks.c is related, but this is still a
presumption; there are 10 kernel patches inbetween; an
egrep for fs/locks.c in those ten intervening patches
reveals .85, .86, and .89 which also make some
modifications; the crash (with the same symptoms) may
or may not be due to the same problem because of this,
but the likelyhood of it being different is low enough
that I'll bark down this path for now (egrep found:)

patch-1.3.85:diff -u --recursive --new-file v1.3.84/linux/fs/locks.c linux/fs/locks.c
patch-1.3.85:--- v1.3.84/linux/fs/locks.c Fri Apr 5 13:35:28 1996
patch-1.3.85:+++ linux/fs/locks.c Mon Apr 8 12:13:18 1996
patch-1.3.86:diff -u --recursive --new-file v1.3.85/linux/fs/locks.c linux/fs/locks.c
patch-1.3.86:--- v1.3.85/linux/fs/locks.c Mon Apr 8 19:01:45 1996
patch-1.3.86:+++ linux/fs/locks.c Wed Apr 10 08:41:08 1996
patch-1.3.89:diff -u --recursive --new-file v1.3.88/linux/fs/locks.c linux/fs/locks.c
patch-1.3.89:--- v1.3.88/linux/fs/locks.c Wed Apr 10 17:02:25 1996
patch-1.3.89:+++ linux/fs/locks.c Sun Apr 14 14:48:53 1996

o.2) The crashes happened roughly with the pattern of an email
for something coming in; I'm not sure exactly what
destination the mail had (my logs show 383 mailing list
to local USENET sendmail runs for 12 hours Apr 23rd
10am ET - 10pm, and in the same period 73 invocations
of my personal mail reception script mess which finds
an appropriate MH folder to deliver the email to;
they're all inefficient shell scripts -- this is
probably typical traffic). At that rate, that's 94
seconds between emails, which is about how long it
would take for the system to crash after loading it up
with activity. This is one possibility of the crashes.

o.3) I only noticed these problems after upgrading to various
system packages listed in the update
Documentation/Changes (such as ld.so, libc, binutils,
etc.) They did not surface immediately, however I did
become pretty good at reproducing the errors once I
tried to. Turning off most of the dragons (including
cron, sendmail, inn, a few other things) allowed me to
recompile the kernel 1.3.70 so I had a more solid base
from which to peg down what's going on. If sendmail
was the only culprit, this makes a lot of sense, as
does an interaction between sendmail and other
programs, however I find it curious that a thrashing
system (X w/netscape, kernel compile, etc.) increased
the chances of reproducing this quicker. A calm system
would do it too, after time.

o.4) This could be a gotcha of having incompatable libraries.
I consider it a bad kernel bug that it crashes the
kernel. I would not consider it as bad a kernel bug if
it merely crashed a process.

Further actions: I will attempt to reproduce this one more time
by installing the old Slackware 3.0.0 sendmail &
1.3.84, reproducing the error & tracking that, & then
trying 1.3.84 without its fs/locks.c patch. I don't
know when I'll get around to this, so I post this
now.

I lied. I tried it now. It definately is reproducable (just add
Slackware 3.0.0 sendmail binary), but while using what I thought to be
the "safe" 1.3.83, it hung again (I had 100 sendmail processes going
at once), but this time it was *temporary*. See below.

Results of further testing:

Methods I use to invoke the error quicker than ever include starting X
in the background, switching to a non-X virtual console & setting the
DISPLAY environment & starting a whole lot of big processes from there
(not sure if heavy load is even necessary to reproduce this particular
bug, but being on a text console is necessary to be able to see what's
going on in text mode with the SCROLL_LOCK combinations, and it
doesn't swap out the virtcon either), and then invoking a 100 or so
emails to myself at about the same time.

########################################################
#2 hang examined: 1.3.95:

rCTRL-SHIFTLOCK: current: sendmail -39 R

rALT-SHIFTLOCK: EIP: 0000012cdfc EFLAGS 00000202
1106018 00e3e414 01106018 01782258
5ebf20 bfffda30 014becc0 0018 0018 002b 002b

EIP range ... cd dc ec fc (only other changing EFLAGS)

000000000012cdc4 t posix_locks_deadlock <---------------
000000000012ce14 t flock_lock_file

########################################################
#3 hang examined: 1.3.84:

current process: sendmail -80 R

EIP 12c6{65...a3}:
000000000012c65c t posix_locks_deadlock <---------------
000000000012c6ac t flock_lock_file

EFLAGS 206, 202, 206, etc.
EAX was 00000 (!)

fsck on reboot found:
/var/spool/mqueue/dfXAA666T deleted/unused inode
^ (not sure about the ``T'', my writing is bad)

########################################################
#4 hang examined: 1.3.83:

current process: sendmail -61 R

eventually came back! I didn't have time to press rALT-SCROLL_LOCK :(

(afterwards it's in 1095e4 a lot ... sys_idle :) oh ok never mind :) )

Please note that 1.3.83 didn't hang permanently, whereas 1.3.84 & up did.

########################################################
#5:
My last attempt will be 1.3.84 without the fs/locks.c patch applied
(so, in other words, the 1.3.83 fs/locks.c):

I'm having a lot of trouble hanging this. I did get this message
printed a lot of times:

gdbm fatal: couldn't init cache.

I was sitting in these EIPs a lot while this was happening
and my machine was swapping to the moon & back:

1095e4 (109588-f7 sys_idle)
110063 (10fdf8-110067 schedule)
110045 (same as ^)
10ff6b (10fdf8-110068 schedule)
10ff7f (same as ^)

12:35am up 23 min, 3 users, load average: 0.29, 12.82, 20.33
161 processes: 160 sleeping, 1 running, 0 zombie, 0 stopped

Q:~# ps uax|egrep -i sendmail|wc
117 1291 8212

The sendmail processes still haven't gone away.

I killed the old sendmails & moved the new one back into place &
restarted the new sendmail. Six hours later, the machine is running
fully & reliably despite having gone through torture testing before
with the plagued sendmail & libc combination; fs/locks.c is clearly
a key to what's at fault here.

########################################################

Things I haven't tried:

(1) Reproducing the bug with libc5.0.9
(2) Compiling sendmail under libc5.0.9 & reproducing the bug as above
(3) Finding the spot in sendmail where it hangs
(4) Fiddling with fs/locks.c (this would be fun but I don't have the time)

Looking at (4), I decided to prove how much of a lyer I can be and
look at the file, and lo! Just what it says in the comments ...
"FIXME":

/* This function tests for deadlock condition before putting a process to
* sleep. The detection scheme is recursive... we may need a test to
* make it exit if the function gets stuck due to bad lock data. 4.4
* BSD uses a maximum depth of 50 for this.
*
* FIXME:
* IMHO this function is dangerous, deep recursion may result in kernel stack
* corruption. Perhaps we need to limit depth here.
* Dmitry Gorodchanin 09/02/96
*/
static int posix_locks_deadlock(struct task_struct *my_task,
struct task_struct *blocked_task)
{
[...]

Is there a way to autodetect "kernel stack corruption"?? Don't
processors these days have interrupt hooks, supposedly for routines
that will allocate more stack space & watch the thread carefully to
see if it is getting out of control? The last processor I programmed
was the 6502, which had a static stack space of $100 (256 bytes), but
I started to write routines that would do stack paging to make it
bigger on my multitasking system (imagine).

In partial summary:

Definately this is a buggy thing that seems to be a problem
pre-1.3.84, and just crashes the system post-1.3.83 (simpler :) )
Please read above for details (regarding libraries, functions,
programs & special combinations thereof).

There, I'm going to leave it at that & let someone else take it up
from here. I'm really quite tired of this (spent many dozen hours on
this) so want to get back to some real work & play, even though I was
happy to diagnose this one thing as far as I did for the benefit of
all as usual. I'll definately try out whatever ideas people have that
seem like the one (or two or more) true fix(es).

Bradley Allen <Ulmo@Q.Net>

P.S. This message was updated & resent, since the first one seems to
have found its way to vger's bit bucket. (Lesson: don't send long
subject lines via zmailer systems. Hmm, zmailer? Would that fix my
sendmail problems too? Or is zmailer prejudiced about messages
debugging its competition?)