On 1/18/2010 5:17 PM, Jarek Poplawski wrote:Ok - one last update for a while ...not sure what's next... I put some printk's into sky2.c xmit logic - the packets are being sent to the card, and the i/o's are completing successfully. So it would seem either the switch is dropping the packets, or else the wifi router is. As tcpdump doesn't show the packets arriving on the wifi router, I'm leaning towards the switch. I ran wireshark on the win7 box to see what is coming off the switch. I did notice one thing that's visible to the win7 box but is not showing up on the linux wireshark - before every successful dhcpoffer, there's an XID message broadcast from the device. I'm wondering why I don't see this on the linux side:On Mon, Jan 18, 2010 at 11:08:14PM +0100, Jarek Poplawski wrote:Well - no.... but I'm not sure that would show anything.Btw, I wonder if you could test it skipping the (HP?) switch?If so, then of course don't forget to try tcpdump on the router.
Jarek P.
Setup diagram:
Server->gb switch-> (100mb) wifi router -> devices
|
Win7 PC (gb)
The problem does not occur (at least I haven't been able to recreate it) at 100mb, and the wifi router doesn't do 1Gb. I drive the traffic from the win7 PC to the server. I've seen the loss when the only traffic going through the wifi router was ping & dhcp. I've also never seen any loss on a device directly attached to the 1GB switch. I can drive load through the wifi router while driving load from the Win7 box, but don't see TX packet loss at all when not doing DHCP RELEASE/RENEW.
As there is no packet loss to devices not involved in the DHCP sequence through the same path, I'm not really sure that the GB switch is implicated.
As I don't have a standalone sniffer, I'm thinking that it might be easier to instrument places where the TX packet could be dropped and see at least whether it's getting to the card.
Given the circumstances of the TX drop, and that it was DHCP traffic while under load that caused the oops rectified with the two patches, I'm thinking that the packet loss is the current manifestation of whatever the underlying problem is. Given the extra hop required to break things, and given that a dhcp release/renew seems to trigger things, I keep coming back to arp logic as being somehow implicated.
If arp is somehow involved, then I'd expect to see manifestations under similar circumstances with other drivers. As the pskb_may_pull patch stopped the crash, perhaps other drivers do suffer packet loss and it's just not been widely noticed or attributed to the kernel - especially if the network topology is a factor. I do know people at large enterprises who have been complaining of what *could* be this same issue, however they're currently blaming their switch vendors. As most traffic is TCP, this is really only noticed by those few places deeply concerned with latency. It's likely something altogether different, but then again, maybe not.