reproducible bug in 2.2.19 & 2.4.x

From: Balazs Scheidler (bazsi@balabit.hu)
Date: Fri Sep 28 2001 - 06:01:38 EST


Hi,

We encountered SIGSEGV problems in our massively multithreaded application,
which we tracked down to a kernel issue. At least it seems to be a kernel
issue. I wrote a simple test program attached to this mail, which can be
used to reproduce this SIGSEGV.

How to reproduce it:
  - each new session starts a new thread (TCP sessions)
  - sometimes sessions break up causing a SIGPIPE to be delivered
  - each thread logs messages using syslog (libc changes SIGPIPE settings
    before really sending the message to avoid SIGPIPE terminating the
    application)
  - SIGPIPE is SIG_IGNed (it doesn't seem to matter whether I use SIG_IGN or
    an empty signal handler function)

The test program attached (stressthreads.c) opens a socket, binds to a port,
and starts listening for new connections. In each new thread, it sends a
syslog message, writes 1MB in 1024 byte chunks to the connection, closes the
connection and exits.

Also attached a python script starting up 100 threads, and connecting to the
given ip/port, reading 1024 bytes of data, then closing the connection. So
effectively we cause SIGPIPE in each thread in stresstest during their write
call. To run the test, compile stressthreads.c on one computer and run it,
it'll start listening on 0.0.0.0:10000. Modify the end of test-zorp.py, and
set the correct IP address of your first host in the script.

Start the script, and the other end will crash.

The program SEGFAULTs within a second when run on our PIII 800 SMP
test system. The SEGFAULT did not occur on any of our non-SMP systems. (even
with the same PIII with a non-SMP kernel)

I added a backtrace function to my test program to show where it
aborts, and here's the result:

Signal (11) received, stackdump follows; eax='ffffffe0', ebx='0000000c', ecx='be5ff96c', edx='00000400', eip='00000001'
retaddr=0x1, ebp=0xbe5ff944
retaddr=0x804892a, ebp=0xbe5ffd74
retaddr=0x4001bc9f, ebp=0xbe5ffe34

The program _always_ aborted at eip=0x1.

the program didn't abort if I removed the syslog() function call from the
thread.

The results are same for a 2.4.5 kernel (so I assume it affects later kernel
versions as well).

Ideas, solutions, any help welcome.

PS: I'm not subscribed, so please CC me the replies.

-- 
Bazsi
PGP info: KeyID 9AF8D0A9 Fingerprint CD27 CFB0 802C 0944 9CFD 804E C82C 8EB1



- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Sep 30 2001 - 21:01:01 EST