SHM -isn't- broken, something else is

From: david (david@killerlabs.com)
Date: Mon Mar 20 2000 - 20:36:26 EST


Ok, I'm getting nowhere fast. It isn't shm but having shm enabled seemed to
exacberbate the real problem.

The problem is that processes get stuck in D state and unfortunately I can't
see what they are waiting on. ps hangs at the first process that's broken
when i run 'ps axwl'. sysrq shows all sorts of different EIPs, some of them
at 00000246....odd place.

It happens even with apache stopped and proftpd stopped. I haven't found
the magic key yet but here's a clue. up for 30 minutes, nothing happening.
I log in, run pine, and start to send this mail. *bam* in a few seconds my
watchdogd kicks off the kernel's watchdog reboot.

Note, composing the the email is fine. Saving it to /tmp/pico.123456 is
what appears to have triggered it.
  
(side note, here's watchdogd.c)

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
 
int main(int argc, const char *argv[])
{
   int fd=open("/dev/watchdog",O_WRONLY);
   if(fd==-1) {
        perror("watchdog");
        exit(1);
   }
   while(1) {
        sync();
         system("/bin/ps awlx >/root/psawlx");
        write(fd,"\0",1);
         sleep(20);
   }
}

Note, /root/psawlx shows _nothing_ out of the ordinary. no D state, just S
and a the one R of the ps process.

As I send -this- email attempt, I am going to strace it and see if I can
catch something of a clue

how odd, note how "strace -f -p 213" exited when it shouldn't have

[...]
write(9, "QUIT\r\n", 6) = 6
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
read(10, "221 james.kalifornia.com closing"..., 4096) = 45
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
close(9) = 0
close(10) = 0
rt_sigaction(SIGALRM, {SIG_IGN}, {0x8158630, [], SA_RESTART|0x4000000}, 8) =
0
time(NULL) = 953579769
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x8157fc0, [], SA_RESTART|0x4000000}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({600, 0}, unknown pid: 259

James:~# ps auxw|grep pine
david 213 0.1 1.2 5012 2376 pts/0 T 11:13 0:00 pine

(awxl)
000 501 213 163 0 0 5012 2376 syscal T pts/0 0:00 pine

the machine is still running at this point

ok, it worked after -CONT, let's try another email.

perhaps it is the size of the body.

ok, it locked, watchdog snagged it and it's rebooting. here's the session,
not the same as before, note where things ended. now this could be
funkiness with strace timing or it could really be a broken kernel.

write(9, "QUIT\r\n", 6) = 6
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
rt_sigprocmask(SIG_BLOCK, [ALRM], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], [ALRM], 8) = 0
read(10, unknown pid: 364

my guess is something bad is happening and the process died or was killed in
some fashion by the kernel and it's leaving some other kernel function in a
bad state.

-d

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Thu Mar 23 2000 - 21:00:31 EST