I have made some experiments on what seems to be a die hard kernel bug.
I run the following program on my 2.1.78 linux alpha:
#include <stdio.h>
#include <syslog.h>
int main()
{
openlog("uk", LOG_PID | LOG_PERROR, LOG_USER);
syslog(LOG_ERR, "coucou");
return 0;
}
(I run it as a normal user)
Then the kernel dumps this (on the console and in the messages):
Jan 13 14:40:51 orion kernel: 00003f4e24>] (fffffc00003f4efc)
Jan 13 14:40:51 orion kernel: uk: Exception at [<fffffc00003b15bc>] (fffffc00003b15c4)
Jan 13 14:40:51 orion kernel: uk: Exception at [<fffffc00003f4e24>] (fffffc00003f4efc)
Jan 13 14:40:51 orion kernel: uk: Exception at [<fffffc00003b15bc>] (fffffc00003b15c4)
Jan 13 14:40:51 orion kernel: uk: Exception at [<fffffc00003f4e24>] (fffffc00003f4efc)
[... many lines ...]
Until the ^C I typed to stop the program find its way to the process.
There are some variations in the adresses, but they all point to
the write_chan() function, in drivers/char/n_tty.c, or to __copy_user,
in arch/alpha/lib/copy_user.S. What find its way to the stderr is a text
looking like this:
<11>Jan 13 14:40:12 uk[5317]: coucou± ¨FUU¨FUUĐAUUAUU>UU˙˙˙˙˙˙˙˙˙˙˙˙Ř+ P;U
... and many such buggy characters that scratch my xterm. If I redirect
the stderr to a file, I get a 21712 bytes long file but only these messages
and no quasi-crash:
Jan 13 15:09:16 orion kernel: uk: Exception at [<fffffc00003f4e40>] (fffffc00003f4efc)
Jan 13 15:09:16 orion kernel: uk: Exception at [<fffffc00003f4e40>] (fffffc00003f4efc)
Jan 13 15:09:16 orion uk[5568]: coucou
from /System.map:
fffffc00003b1408 t write_chan
fffffc00003b16f0 t normal_poll
(at 0xfffffc00003b15bc there is a "ldq_u t5,0(s2)" instruction)
and:
fffffc00003f4dc0 T __copy_user
fffffc00003f4f74 T __do_clear_user
(at 0xfffffc00003f4e24 there is a "ldq_u t2,0(t6)" instruction)
I ran also a strace on the program and here is what I get:
[... standard ld.so initialisation ...]
getxpid() = 4957
writev(2, [{"\0<11>Jan 13 13:08:47 uk[4957]: "..., 37}, {"\n", 1}], 2) = 24574
--- SIGINT (Interrupt) ---
+++ killed by SIGINT +++
Shouldn't writev return a number between -1 and 38 ? What does this 24574
means ?
If I strace the redirected to a file version, I get the following:
getxpid() = 5568
writev(2, [{"\0<11>Jan 13 15:09:16 uk[5568]: "..., 37}, {"\n", 1}], 2) = 21712
sigaction(SIGPIPE, {0x155556faca8, [], 0}, {SIG_DFL}) = 0
socket(PF_UNIX, SOCK_DGRAM, 0) = 4
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
connect(4, {sun_family=AF_UNIX, sun_path="/dev/log"}, 16) = -1 EPROTOTYPE (Proto
col wrong type for socket)
close(4) = 0
socket(PF_UNIX, SOCK_STREAM, 0) = 4
fcntl(4, F_SETFD, FD_CLOEXEC) = 0
connect(4, {sun_family=AF_UNIX, sun_path="/dev/log"}, 16) = 0
send(4, "<11>Jan 13 15:09:16 uk[5568]: co"..., 37, 0) = 37
sigaction(SIGPIPE, {SIG_DFL}, NULL) = 0
exit(0) = ?
We see the 21712 number, corresponding to the size of the file I get. But
what in the heck happens ????
It would be really great if someone more competent than me would consider
this, and hopefully fix it.
--Thomas Pornin
PS: I have tried 2.1.x kernel since version 2.1.65 on my alpha and I always
got those "Exception at" loops on time to time, which is annoying. I guess
one of my daemons (probably the portmapper) executes sometimes some syslog()
that triggers the bug.