RE: [PATCH net-next] hv_netvsc: don't make assumptions on struct flow_keys layout

From: KY Srinivasan
Date: Fri Jan 08 2016 - 13:01:42 EST




> -----Original Message-----
> From: John Fastabend [mailto:john.fastabend@xxxxxxxxx]
> Sent: Thursday, January 7, 2016 10:17 PM
> To: KY Srinivasan <kys@xxxxxxxxxxxxx>; Vitaly Kuznetsov
> <vkuznets@xxxxxxxxxx>; Simon Xiao <sixiao@xxxxxxxxxxxxx>; Eric Dumazet
> <eric.dumazet@xxxxxxxxx>
> Cc: Tom Herbert <tom@xxxxxxxxxxxxxxx>; netdev@xxxxxxxxxxxxxxx;
> Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>; devel@xxxxxxxxxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; David Miller <davem@xxxxxxxxxxxxx>
> Subject: Re: [PATCH net-next] hv_netvsc: don't make assumptions on struct
> flow_keys layout
>
> On 16-01-07 07:49 PM, KY Srinivasan wrote:
> >
> >
> >> -----Original Message-----
> >> From: John Fastabend [mailto:john.fastabend@xxxxxxxxx]
> >> Sent: Thursday, January 7, 2016 5:02 PM
> >> To: Vitaly Kuznetsov <vkuznets@xxxxxxxxxx>; Simon Xiao
> >> <sixiao@xxxxxxxxxxxxx>; Eric Dumazet <eric.dumazet@xxxxxxxxx>
> >> Cc: Tom Herbert <tom@xxxxxxxxxxxxxxx>; netdev@xxxxxxxxxxxxxxx; KY
> >> Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang
> <haiyangz@xxxxxxxxxxxxx>;
> >> devel@xxxxxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; David Miller
> >> <davem@xxxxxxxxxxxxx>
> >> Subject: Re: [PATCH net-next] hv_netvsc: don't make assumptions on
> struct
> >> flow_keys layout
> >>
> >> On 16-01-07 05:28 AM, Vitaly Kuznetsov wrote:
> >>> Eric Dumazet <eric.dumazet@xxxxxxxxx> writes:
> >>>
> >>>> On Thu, 2016-01-07 at 10:33 +0100, Vitaly Kuznetsov wrote:
> >>>>> Recent changes to 'struct flow_keys' (e.g commit d34af823ff40 ("net:
> Add
> >>>>> VLAN ID to flow_keys")) introduced a performance regression in
> netvsc
> >>>>> driver. Is problem is, however, not the above mentioned commit but
> the
> >>>>> fact that netvsc_set_hash() function did some assumptions on the
> struct
> >>>>> flow_keys data layout and this is wrong. We need to extract the data
> we
> >>>>> need (src/dst addresses and ports) after the dissect.
> >>>>>
> >>>>> The issue could also be solved in a completely different way: as
> suggested
> >>>>> by Eric instead of our own homegrown netvsc_set_hash() we could
> use
> >>>>> skb_get_hash() which does more or less the same. Unfortunately,
> the
> >>>>> testing done by Simon showed that Hyper-V hosts are not happy with
> our
> >>>>> Jenkins hash, selecting the output queue with the current algorithm
> based
> >>>>> on Toeplitz hash works significantly better.
> >>>>
> >>
> >> Also can I ask the maybe naive question. It looks like the hypervisor
> >> is populating some table via a mailbox msg and this is used to select
> >> the queues I guess with some sort of weighting function?
> >>
> >> What happens if you just remove select_queue altogether? Or maybe
> just
> >> what is this 16 entry table doing? How does this work on my larger
> >> systems with 64+ cores can I only use 16 cores? Sorry I really have
> >> no experience with hyperV and this got me curious.
> >
> > We will limit the number of VRSS channels to the number of CPUs in
> > a NUMA node. If the number of CPUs in a NUMA node exceeds 8, we
> > will only open up 8 VRSS channels. On the host side currently traffic
> > spreading is done in software and we have found that limiting to 8 CPUs
> > gives us the best throughput. In Windows Server 2016, we will be
> > distributing traffic on the host in hardware; the heuristics in the guest
> > may change.
> >
> > Regards,
> >
> > K. Y
>
> I think a better way to do this would be to query the numa node when
> the interface comes online via dev_to_node() and then use cpu_to_node()
> or create/find some better variant to get a list of cpus on the numa
> node.
>
> At this point you can use the xps mapping interface
> netif_set_xps_queue() to get the right queue to cpu binding. If you want
> to cap it to max 8 queues that works as well. I don't think there is
> any value to have more tx queues than number of cpus in use.
>
> If you do it this way all the normal mechanisms to setup queue mappings
> will work for users who are doing some special configuration and the
> default will still be what you want.
>
> I guess I should go do this numa mapping for ixgbe and friends now that
> I mention it. Last perf numbers I had showed cross numa affinitizing
> was pretty bad.
>
> Thanks,
> John
John,

I am little confused. In the guest, we need to first open the sub-channels (VRSS queues) based on what the host is offering. While we cannot open more sub-channels than what the host is offering, the guest can certainly open fewer sub-channels. I was describing the heuristics for how many sub-channels the guest currently opens. This is based on the NUMA topology presented to the guest and the number of VCPUs provisioned for the guest. The binding of VCPUs to the channels occur at the point of opening these channels.

Regards,

K. Y