routing problem b/w 2.2 and PM3

Soren Harward (soren@cinternet.net)
Thu, 1 Jul 1999 17:14:38 -0400


A few months ago, I upgraded our company webserver (www.cinternet.net)
to 2.2. Soon thereafter, I started having some nasty routing troubles
between it and our portmasters (we have two PM3's). The web
server and the PM's share a physical Ethernet segment and a Class C
(206.112.217.0). Our dialup pools are assigned from blocks within
the 206 network (100 thru 149 and 150 thru 199). We also have a
second class C (209.50.110.0) running on the same segment, of which
addresses 10 thru 199 are bound to eth0 aliases on the web server,
so the web server actually has addresses on 2 different class C's
thru the same Ethernet device and the same physical connection. All our
routing is static. If it sounds messed up, that's because it is
(I inherited it, I didn't design it).

The main symptom of the problem is the inability to establish a network
connection between a dialup system and one of the virtual hosts (the 209
network) on the web server. It's been very rare and impossible to
duplicate, but in the few times it's happenned to me, I've been able to
determine that the problem is not that the data isn't reaching the web
server, but that it's not making it back. In the times I've gotten it
to happen to me, I've telnetted circuitously into the web server, and
run 'netstat -a'. Sure enough, I can see that the connection between my
dialup workstation and the virtual host is open, but it's stuck at
SYN_RECV stage [0]. So the web server is getting the TCP/SYN okay, but
the SYN/ACK isn't making it back to the dialup computer, presumably
because the web server doesn't know how to route it back (it has to use
the PM3 as a gateway). The instant the dialup establishes a
connection with www.cinternet.net (206.112.217.10), then the problem
disappears, and as long as the dialup host stays connected, it never
has any more trouble getting data from the 209.50.110 network.

It's taken me almost 2 months to figure out exactly what's been going
on, since the problem is rare (ie, it has no discernable pattern among
dialups, specific usernames, times of day, or phases of the moon -- I
had my witch doctor check) and thus far I haven't been able to duplicate
it at will. Has anyone else seen a problem like this? Does large
numbers of "ip_rt_advice: route from 206.112.217.XXX dropped" messages
in the web server kernel message log have anything to do with it? I
thought about hacking static routes from the web server thru the
portmaster to the dialups, but we have about a dozen static-IP customers
(not including the admins) who could end up behind either PM3, so that
fix won't work for them, and they have reported similar problems.
Right now, my inclination is to enable OSPF between our PM3's and our
Cisco 2500, but I'm not sure that that will work because it doesn't do
anything for the web server. And all this worked before the 2.2
upgrade, and I have yet to have someone have a problem with accessing
the rest of the 209.50.110 network (addresses 200 to 249) which are
bound to our NT server (also on the same Ethernet segment and also with
an address in the 206.112.217.0 network -- same as the web server). So
the problem right now seems to be the web server alone, and a kernel
problem with the web server to narrow it down. I'm basically at my
wits' end with a handful of upset customers [1] who see this a "my site
being down again" and won't follow the workaround instructions I've
given them [2]. Can anybody help? Will OSPF solve the problem, or
is it indeed something wrong with the web server? Do I need to do
something as drastic as running OSPF on the web server too (yikes!)?

[0] SYN_RECV or whatever the proper label is

[1] If you'd like to explain routing and class C's to someone who
doesn't understand how to use an FTP program to upload their webpages,
be my guest.

[2] I've told them to do one of 2 things: use the proxy server we have
running (on a different machine, but on the same Ethernet segment as the
web server and no addresses in the 209 block), or go to our
www.cinternet.net webpage first before trying to go to any hosted site.

-- 
 Soren Harward <soren@cinternet.net> | Windows DOES come with a tool
 http://www.soren.cinternet.net/     | to restore a corrupt Registry.
 Internet Information Systems Admin  |
 Cinternet, Inc.  (513) 891-1228     | It's called FDISK.

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/