Re: [PATCH net-next] ibmveth: v1 calculate correct gso_size and set gso_type

From: Brian King
Date: Wed Nov 09 2016 - 13:13:13 EST


On 11/06/2016 03:22 PM, Jonathan Maxwell wrote:
> On Thu, Nov 3, 2016 at 8:40 AM, Brian King <brking@xxxxxxxxxxxxxxxxxx> wrote:
>> On 10/27/2016 10:26 AM, Eric Dumazet wrote:
>>> On Wed, 2016-10-26 at 11:09 +1100, Jon Maxwell wrote:
>>>> We recently encountered a bug where a few customers using ibmveth on the
>>>> same LPAR hit an issue where a TCP session hung when large receive was
>>>> enabled. Closer analysis revealed that the session was stuck because the
>>>> one side was advertising a zero window repeatedly.
>>>>
>>>> We narrowed this down to the fact the ibmveth driver did not set gso_size
>>>> which is translated by TCP into the MSS later up the stack. The MSS is
>>>> used to calculate the TCP window size and as that was abnormally large,
>>>> it was calculating a zero window, even although the sockets receive buffer
>>>> was completely empty.
>>>>
>>>> We were able to reproduce this and worked with IBM to fix this. Thanks Tom
>>>> and Marcelo for all your help and review on this.
>>>>
>>>> The patch fixes both our internal reproduction tests and our customers tests.
>>>>
>>>> Signed-off-by: Jon Maxwell <jmaxwell37@xxxxxxxxx>
>>>> ---
>>>> drivers/net/ethernet/ibm/ibmveth.c | 20 ++++++++++++++++++++
>>>> 1 file changed, 20 insertions(+)
>>>>
>>>> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
>>>> index 29c05d0..c51717e 100644
>>>> --- a/drivers/net/ethernet/ibm/ibmveth.c
>>>> +++ b/drivers/net/ethernet/ibm/ibmveth.c
>>>> @@ -1182,6 +1182,8 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>>> int frames_processed = 0;
>>>> unsigned long lpar_rc;
>>>> struct iphdr *iph;
>>>> + bool large_packet = 0;
>>>> + u16 hdr_len = ETH_HLEN + sizeof(struct tcphdr);
>>>>
>>>> restart_poll:
>>>> while (frames_processed < budget) {
>>>> @@ -1236,10 +1238,28 @@ static int ibmveth_poll(struct napi_struct *napi, int budget)
>>>> iph->check = 0;
>>>> iph->check = ip_fast_csum((unsigned char *)iph, iph->ihl);
>>>> adapter->rx_large_packets++;
>>>> + large_packet = 1;
>>>> }
>>>> }
>>>> }
>>>>
>>>> + if (skb->len > netdev->mtu) {
>>>> + iph = (struct iphdr *)skb->data;
>>>> + if (be16_to_cpu(skb->protocol) == ETH_P_IP &&
>>>> + iph->protocol == IPPROTO_TCP) {
>>>> + hdr_len += sizeof(struct iphdr);
>>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV4;
>>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>>>> + } else if (be16_to_cpu(skb->protocol) == ETH_P_IPV6 &&
>>>> + iph->protocol == IPPROTO_TCP) {
>>>> + hdr_len += sizeof(struct ipv6hdr);
>>>> + skb_shinfo(skb)->gso_type = SKB_GSO_TCPV6;
>>>> + skb_shinfo(skb)->gso_size = netdev->mtu - hdr_len;
>>>> + }
>>>> + if (!large_packet)
>>>> + adapter->rx_large_packets++;
>>>> + }
>>>> +
>>>>
>>>
>>> This might break forwarding and PMTU discovery.
>>>
>>> You force gso_size to device mtu, regardless of real MSS used by the TCP
>>> sender.
>>>
>>> Don't you have the MSS provided in RX descriptor, instead of guessing
>>> the value ?
>>
>> We've had some further discussions on this with the Virtual I/O Server (VIOS)
>> development team. The large receive aggregation in the VIOS (AIX based) is actually
>> being done by software in the VIOS. What they may be able to do is when performing
>> this aggregation, they could look at the packet lengths of all the packets being
>> aggregated and take the largest packet size within the aggregation unit, minus the
>> header length and return that to the virtual ethernet client which we could then stuff
>> into gso_size. They are currently assessing how feasible this would be to do and whether
>> it would impact other bits of the code. However, assuming this does end up being an option,
>> would this address the concerns here or is that going to break something else I'm
>> not thinking of?
>
> I was discussing this with a colleague and although this is better than
> what we have so far. We wonder if there could be a corner case where
> it ends up with a smaller value than the current MSS. For example if
> the application sent a burst of small TCP packets with the PUSH
> bit set. In that case they may not be coalesced by GRO. The VIOS could

Would that be a big problem though? Other than a performance degradation
in this specific case, do you see a functional issue with this approach?

> probably be coded to detect that condition and use the previous MSS.
> But that may not necessarily be the current MSS.
>
> The ibmveth driver passes the MSS via gso_size to the VIOS. Either as the
> 3rd argument of ibmveth_send() or via tcp_hdr(skb)->check which is
> presumably over-written when the VIOS does the TSO. Would it be possible
> to keep a copy of this value on the TX side on the VIOS before it over-written
> and then some how pass that up to the RX side along with frame and set
> gso_size to that which should be the current MSS?

That seems like it might be more difficult. Wouldn't that require the VIOS
to track the MSS on a per connection basis, since the MSS might differ
based on the source / destination? I discussed with VIOS development, and
they don't do any connection tracking where they'd be able to insert this.

Unless there is a functional issue with the approach to have the VIOS insert
the MSS based on the size of the packets received off the wire, I think that
might be our best option...

Thanks,

Brian



>
> Regards
>
> Jon
>
>>
>> Unfortunately, I don't think we'd have a good way to get gso_segs set correctly as I don't
>> see how that would get passed back up the interface.
>>
>> Thanks,
>>
>> Brian
>>
>>
>> --
>> Brian King
>> Power Linux I/O
>> IBM Linux Technology Center
>>
>


--
Brian King
Power Linux I/O
IBM Linux Technology Center