Hello,
We are developing an advanced networking services loadable module and are
having problems porting it to work on 2.4.x kernels. The driver is supposed
to provide services such as fault tolerance, load balancing and link
aggregation over a team of network adapters. It works OK on 2.2.x kernels
but hangs on 2.4.x kernels.
In order to debug it, we stripped it down to become a mere "intermediate" or
"filter" driver that binds to a base driver and passes everything through in
both directions (Rx, Tx, IOCTL, stats, etc.). After going through the basics
of modifying the driver to compile on 2.4.x kernels and fighting some nasty
dead locks due to the new nature of the networking layer, we managed to get
it to run. The driver will receive and transmit a few hundreds of thousands
of packets (while having a periodic timer expire 10 times a second and
running continuous IOCTLs), and then it causes an oops about not being able
to handle a page fault.
The function looks something like:
int iansHardStartXmit(struct sk_buff *skb, struct net_device *dev) {
int res;
struct net_device *base;
spin_lock(&lock);
base = get_base_driver_by_name(name);
if(base != NULL) {
res = base->hard_start_xmit(skb, base);
}
spin_unlock(&lock);
return res;
}
We used kdb in order to track down the problem and found out the following
stack trace:
EBP EIP function(args)
0xc4cd1c54 0xd081e3e7 [e100]__kallsyms+0xb (0xc4b595a0,
0xc840f200)
e100 __kallsyms 0xd081e3dc
0xd081e3dc 0xd0820dsc
0xd08244ba [ians]iansHardStartXmit+0xa6 (0xc4b595a0,
0xc4d9bc00)
ians .text 0xd0824060 0xd0824414
0xd082452c
0xc01f9d1f qdisc_restart+0xcf (0xc4d9bc00)
kernel .text 0xc0100000 0xc01f9c50
0xc01f9f14
*
*
*
This goes on and shows that this is an ICMP echo reply packet going down
through the IP stack to the filter driver (apparently 0xc4b595a0 is the skb,
0xc4d9bc00 is the *dev of the filter driver and 0xc840f200 is the *dev of
the base driver). The filter driver is supposed to call the
dev->hard_start_xmit of the base driver, but strangely it lands somewhere in
the data segment of the base driver (__kallsyms is a part of the symbol
table of the module according to insmod -m).
Figuring the dev->hard_start_xmit pointer got trashed somehow, we added a
check to make sure the same pointer is always called, and indeed this was
the case. Looking at the assembly code with kdb, we could see that the call
to the base driver is done by a 'call *%eax' command. kdb reports that
eax=0xffffffff after the page fault (origeax).
How is it possible that the pointer to the function keeps it's value, but
the jump to that function falls somewhere else ?
The entire function is protected by a spinlock, so there is no worry about
the other threads messing my data.
We are using:
RedHat 6.2
gcc v2.91.66
modutils v2.3.11-1
kernel linux-2.4.0-test9
kdb v1.5-2.4.0-test9-pre9
Compaq ap500 dual p-III Xeon
Thanks,
Shmulik Hen
Software Engineer
Linux Advanced Networking Services
Network Communications Group, Israel (NCGj)
Intel Corporation Ltd.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/
This archive was generated by hypermail 2b29 : Tue Oct 31 2000 - 21:00:13 EST