The problem is that the function xen_netbk_count_skb_slots() returns two different counts for same type packets of same size (ICMP,3991). At the start of the test.. snip..wrong, which caused the RX ring to be erroneously declared full,
and the receive queue to be stopped. The problem shows up when two
guest running on the same server tries to communicates using large
The function name is xen_netbk_count_skb_slots() in net-next. ThisYes, I don't think that patchset was intended for prime time just yet.
appears to depend on the series in
<http://lists.xen.org/archives/html/xen-devel/2012-01/msg00982.html>.
Can this issue be reproduced without it?
It was based on 3.4, but the bug and work to fix this was done on top of
a 3.4 version of netback backported in a 3.0 kernel. Let me double check
whether there were some missing patches.
Just outside the context of this patch we separately count the fragint i, copy_off;The new version would be equivalent to:
count = DIV_ROUND_UP(
- offset_in_page(skb->data)+skb_headlen(skb), PAGE_SIZE);
+ offset_in_page(skb->data + skb_headlen(skb)), PAGE_SIZE);
count = offset_in_page(skb->data + skb_headlen(skb)) != 0;
which is not right, as netbk_gop_skb() will use one slot per page.
pages.
However I think you are right if skb->data covers > 1 page, since the
new version can only ever return 0 or 1. I expect this patch papers over
the underlying issue by not stopping often enough, rather than actually
fixing the underlying issue.
Ah, any thoughts? Have you guys seen this behavior as well?
The real problem is likely that you're not using the same condition toAgreed, it would be useful to see the argument for this patch presented
stop and wake the queue.
in that light. In particular the relationship between
xenvif_rx_schedulable() (used to wake queue) and
xen_netbk_must_stop_queue() (used to stop queue).
Do you have any debug patches to ... do open-heart surgery on the
rings of netback as its hitting the issues Adnan has found?
As it stands the description describes a setup which can repro the
problem but doesn't really analyse what actually happens, nor justify
the correctness of the fix.
Hm, Adnan - you dug in to this and you got tons of notes. Could you
describe what you saw that caused this?