2.4.0-test6 network socket problems

From: J. Scott Kasten (jsk@tetracon-eng.net)
Date: Fri Oct 13 2000 - 14:26:46 EST


I'm working with test6 on an embedded QED MIPS arch in big endian mode. I
have run into some bizarre socket problems that appear to affect both udp
and tcp transport. Applications actively using sockets (examples, ftp,
tftp, others...) will unexpectedly stop receiving data on the socket, even
though data is present. The process will be forever sleeping on the read
even though data is queued up. To illustrate my point, I've dug deep into
the udp code (net/ipv4/udp.c) and the datagram core (net/core/datagram.c)
researching the simple tftp example.

After much debugging, here is what I know:

I have followed the packets in from the network driver all the way to
udp_rcv() in udp.c. I see it do the sk lookup and drop it off in
udp_queue_rcv_skb(). Everything is fine on that end.

On the process end, I've been watching in the udp_recvmsg() function, also
in udp.c. Under normal operation, I see it pick up the data from the
correct skb and return. When the rare condition that causes failure
occurs, skb_recv_datagram() returns a NULL and err is set to -ERESTARTSYS.
It is only when the process gets hung on that socket that I see this
happen. It never revisits this portion of the code again, however, until
the sender stops transmitting data from ACK timeouts, I see the packets
continue to pile up on the udp_rcv() side without incident.

I further looked at datagram.c to see what the skb_recv_datagram() was
doing. It was spinning through the do {} while() loop waiting for
wait_for_packet() to hand it something. It is in that routine that the
error code is generated. The signal_pending() function returns true
and sock_intr_errno() returns the -ERESTARTSYS code, which gets passed
back down the chain from here.

The structure of the tftp code that I'm working with is such that it does
a generic blocking read() on the socket file handle and uses an alarm to
wake up when the critical timeout is reached. Not the most glorious code,
but demonstrates a problem none-the-less. The read() never returns an
EINTR or EAGAIN or anything. It's just hung. I'm assuming that the
signal_pending() return comes from my alarm(), which means that the
process had already been sitting on that socket for a while not seeing the
data that is clearly already present. Thus, there may be two problems
here, the signal not returning, and data trapped in the skb.

I would appreciate it if anyone more familiar with this code could point
me better to what I should be looking at, or at least explain what should
be happening that isn't.

TIA,

-S-

--

J. Scott Kasten Email: jsk AT tetracon-eng DOT net

"In most cases, all an argument proves is that two people were present......"

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Oct 15 2000 - 21:00:25 EST