1.3.63 clears my "Aiee: scheduling in interrupt" problem"

Eric Schenk (schenk@cs.toronto.edu)
Fri, 16 Feb 1996 10:28:49 -0500


Brian Dowling (bdowling@tanelorn.ccs.neu.edu):
> For the record, I also no longer have this Aiee problem with 1.3.63.
> I've had
> my box up with this version for only about 8 hours, but no problems so
> far.

Just to verify, I also no longer have this problem with 1.3.63
or 1.3.64. On the other hand, I just had 1.3.64 lock up solid
(even the caps lock and num lock on the keyboard stopped responding.
No messages in the system logs.)

> Does anyone know what the cause of this was? I was trying to probe into
> it at the time I noticed 1.3.63 was available, and I had started to
> make some debugging kernel mods to 1.3.63 before I compiled it --
> then once I had it up, I realized it wasn't a problem anymore. :)

As near as I was able to determine in the short time I spent trying
to track this down, it had something to do with the SOCK_PACKET
interface. This would explain the problems people had with both diald
and tcpdump. It may also explain the problems with samba, but I'm
not sure about that. Now that the problem has been fixed I won't be
looking into it much further, since I can just point people who have
mailed me about the problem at a more recent kernel...

Anyway, the main point of my message is to answer an implied
question about diald below:

> Curiously, one of the things I was trying to do was strace on diald,
> the crash, however, happened so fast and spewed so many errors that
> I couldn't trap any useful information. Now that 1.3.63 is stable, I
> can do this.
> For some unknown reason, when I telnet 'localhost', diald is
> processing all kinds of 'stuff' when my connections are active.
> I can see it reading it all on a socket. I haven't had time to
> looked into this further yet, but that doesn't
> seem right, at least not for localhost connections.
> I have a localhost route to 'lo' (actually it's a -net 127.0.0.0,
> but either does the same).
> Perhaps this is just some /proc file, but a quick glance
> over diald.sources doesn't reveal what it could be.

What you are seeing is a result of diald using a SOCK_PACKET socket
to monitor the packets that cross the network interface it is
monitoring. On older versions of the linux-kernel (pre 1.3.X,
for some X > 40 or so, sorry I don't remember when the change happend)
it was not possible to bind a SOCK_PACKET socket to a particular
network device. This meant that diald got handed EVERY packet
that was sent or received by the system. Diald then had to determine
which interface the packet came across, and react accordingly.

As of diald-0.12, diald will attempt to bind it SOCK_PACKET
socket to a particular fixed interface so that this does not happen.
If you are using an old kernel (say 1.2.13) this will have no effect.
On recent 1.3.X kernels it should cut out diald receiving copies
of every packet. Now that I think about it, I should instrument
the code a bit and check to see that it is in fact working. :-)
I haven't really done much testing with the 1.3.X kernels, as they
tend to lock my machine after about 8 hours with no further symptoms,
and I don't have time right now to try track down the causes.

-- eric

---------------------------------------------------------------------------
Eric Schenk schenk@cs.toronto.edu
Department of Computer Science, University of Toronto