Bug with ARP - request source address on wrong subnet

From: Richard Underwood (richard@aspectgroup.co.uk)
Date: Fri Aug 15 2003 - 07:03:52 EST


Hi,

I have a problem with ARP on Linux 2.4.20 (RedHat 2.4.20-18.8 if it
matters) which I believe to be a bug. While I'm willing to upgrade the
kernel, it appears to be a generic problem.

Our web servers are load-balanced via a Foundry ServerIron using DSR
- which means the return path of the packets doesn't go through the
ServerIron. To allow this to work, the Linux servers have the ServerIron's
valid IP address on a loopback interface and the ServerIron routes packets
rather than the usual address rewriting that goes on.

The relevant interfaces look like this:

eth0 Link encap:Ethernet HWaddr 00:04:75:CA:C4:EF
inet addr:10.10.10.14 Bcast:10.10.10.255 Mask:255.255.255.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1623551911 errors:0 dropped:0 overruns:1 frame:0
TX packets:1575017402 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
RX bytes:2905003530 (2770.4 Mb) TX bytes:3337437145 (3182.8 Mb)
Interrupt:10 Base address:0x8400

lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:355748 errors:0 dropped:0 overruns:0 frame:0
TX packets:355748 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:237452671 (226.4 Mb) TX bytes:237452671 (226.4 Mb)

lo:0 Link encap:Local Loopback
inet addr:212.xxx.yyy.9 Mask:255.255.255.255
UP LOOPBACK RUNNING MTU:16436 Metric:1

The default gateway is 10.10.10.1.

All this works very well - except we have problems with ARP. After
shutting down the web server for a while, the load balancer sees it come
back up, but the web server can't route packets outbound at all.

Looking into it, the following demonstrates the problem:

# arp -d 10.10.10.1
# ping -I 212.xxx.yyy.9 eff.org
PING eff.org (209.237.229.14) from 212.xxx.yyy.9 : 56(84) bytes of data.
^C
# arp -a | grep 10.10.10.1
? (10.10.10.1) at <incomplete> on eth0

On eth0, we see:

11:23:55.650514 0:4:75:ca:c4:ef Broadcast arp 42: arp who-has 10.10.10.1
tell 212.xxx.yyy.9
0001 0800 0604 0001 0004 75ca c4ef d4xx
yy09 0000 0000 0000 0a0a 0a01

The <incomplete> ARP entry remains, blocking all access via the
default gateway. If I miss off the -I 212.xxx.yyy.9, the ARP request
originates from 10.10.10.14 instead and everything works fine.

The problem only occurs after a time of inactivity, and only if the
first ARP request is due to traffic to the 212.xxx.yyy.9 address. Because
the incomplete ARP entry remains, traffic that would normally cause valid
ARP requests don't generate new requests, causing a complete loss of
connectivity.

As I understand it, sending an ARP request with a reply address that
isn't on the local subnet simply doesn't make sense. Section A.3 of RFC985
also suggests such packets should be dropped by the next hop.

The temporary solution is to add static ARP entries for the next
hop, which I will do - however, I believe this is a bug with the Linux
implementation of ARP and should be fixed.

Thanks,

Richard
-
To unsubscribe from this list: send the line "unsubscribe linux-net" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html