Re: 3c59x NIC overruns with multicast makes networking freeze

From: Andrew Morton (andrewm@uow.edu.au)
Date: Fri Sep 22 2000 - 23:23:45 EST


Arnaud Installe wrote:
>
> Hi,
>
> Has anyone else seen a lot of overruns while serving multicast on a pretty
> loaded (60%) network, with 3c59x cards ?

This is the first I've heard of it. Which sort of 3com card?

> (BTW What exactly are
> "overruns" ? Are they buffer overflows on the NIC side or buffer
> overflows on the kernel side, because the software can't follow, or even
> something completely different ?)

The NIC has an 8 kbyte on-board FIFO, some of which (3-5k) is used for
buffering incoming data.

An Rx overrun occurs when that on-board FIFO has filled up, and more
data needs to go into it. It shouldn't fill up. This can be caused by
excessive PCI traffic, or by the software not being able to feed the NIC
new Rx buffers fast enough, or by the driver leaving the Rx engine in
the stalled state for too long.

You should try the 2.2.17 driver. There's a copy at
http://www.uow.edu.au/~andrewm/linux/3c59x-2.2.17+.gz

If it still happens (and it probably will) there are some things you can
look at:

- Get a faster computer! How quick is it?

- Change the driver by doubling the value of RX_RING_SIZE. Keep it a
power of two. This is a bit desperate, but it will give you some more
elasticity.

- Make sure the NIC isn't sharing an interrupt with another device.
Look at /proc/interrupts and if it's shared, have a poke in the BIOS.

- It's also possible that some other device exists on your system which
has a slow interrupt service routine, and is interrupting the 3c59x ISR
for a long time while the Rx engine is stalled. This is a pretty
unlikely scenario. Disabling interrupts around the call to
boomerang_rx() would be slightly interesting.

> We're seeing a lot of overruns on a server which has to handle a lot of
> connections. Those connections are handled by a Java program using the
> IBM Java JDK 1.1.8 on a Linux 2.2.16 kernel.
> After a while the network freezes. I can't ping nor ssh to the machine.
> Restarting the network scripts solves the problem.

There was a bug in the driver which was fixed 6-8 weeks ago. If the
driver's Rx path completely runs out of memory and cannot allocate a new
buffer 16 times in a row, the receive path gets stuck and the interface
needs to be taken down. Moving to the 2.2.17 driver should eliminate
this possibility.

You should also try

        echo "512 1024 1536" > /proc/sys/vm/freepages

to double the amount of memory which is reserved for drivers and such.
This change put a big smile on the face of a 2.2.12 user earlier this
week. (He had 512 megs and a reasonable, but heavy workload. I think
we're setting this too low).

Please let me know the outcome.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Sep 23 2000 - 21:00:28 EST