I asked for some further feedback from Dave Madden (author of
the patch) and here is his response. It appears that he has
quite a bit of production experience with the patch. I believe
that James Laferriere was going to do some testing with the
patch as well.
Larry Blunk
Merit
feedback from Dave Madden <dhm@paradigm.webvision.com>
---------------------------------------------------------------
Well, it's been a while since I fiddled with this, but here's what I
remember. BTW, I have the patch installed in all our Linux systems
that have more than one interface (about 5 of them); they're operating
in an environment with 5 class C addresses subnetted into 20 or 30
subnets (most of these are 3- or 4-bit networks at the far end of a
frame relay PVC). All our routers (Ciscos and Proteons) and many of
our machines (SGI, Sun, and Linux) are running OSPF.
First of all, I should say that gated-3.5 and 3.6 are fairly difficult
to compile under Linux. I've gotten both of them running from stock
sources, but had to do an unnerving amount of header file hackery to
make the compile work. Also, I don't know how well any protocols but
OSPF work. The easiest way to get gated running is to use the patches
from RedHat. They take care of most of the problems, but again, I
haven't looked at anything except OSPF.
The symptom that ultimately resulted in my kernel patch was that gated
would establish adjacencies with its neighbors, exchange routing
updates, and attempt to install the routes it received (all this
according to the gated log messages) but sometimes all the routes
wouldn't make it into the kernel table.
I chased this down through the kernel with printk() statements and
ended up at the loop my patch affects. The problem seemed to be that
gated would install host routes to neighboring routers, and then when
it attempted to install routes where those neighbors were the next
hop, the original logic would look for a way to reach the next-hop
address, find the host route, notice the GATEWAY flag, and return
without installing the new route. I presume the intent was to avoid
installing routes where the next-hop router was itself not directly
reachable, but the effect was to reject routes when the host route to
the next hop happened to appear in the routing hash table before the
network route to the next hop's network.
Incidentally, this may only become apparent when there are enough
routing changes (links up & down) that the kernel's routing table hash
chains get some exercise and the host route gets in front of the
network route.
I hope that makes sense; perhaps an example will help?
Consider a machine with an interface on 1.2.3.0/ffffff00. Its address
is 1.2.3.1, and it becomes neighbors with a router at 1.2.3.2. The
kernel routing table contains:
1.2.3.0/ffffff00 interface-name # network route through the interface
1.2.3.2/ffffffff router-name # host route to neighbor; GATEWAY flag
# is set.
Now, gated wants to add a route to network A.B.C.D, where the next hop
is 1.2.3.2. If the kernel's add-route code encounters the routes in
the order shown above, the route to A.B.C.D will be added. If,
however, the order of encounter is reversed, the add-route code will
reject the route because it thinks the router is not directly
connected. My patch causes the search for an appropriate direct
connection to continue to the end of the hash chain.
* * *
Perhaps a better approach would be to not set the GATEWAY flag on host
routes; but that would likely have wider consequences than what I do,
which is to eliminate a short-circuit return from the add-route
routine.
If you have any other questions, please let me know. My news feed is
down at the moment, and I'm actually out of the country as well, so I
can't promise to respond immediately, but I would like to see this get
fixed in the mainline kernel so I don't have to keep patching whenever
I upgrade.
d.