RE: [PATCH 3/3] hv_netvsc: Implement VF matching based on serial numbers
From: Haiyang Zhang
Date: Fri Dec 09 2016 - 17:08:51 EST
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@xxxxxxxxxxxxxxxxxx]
> Sent: Friday, December 9, 2016 4:45 PM
> To: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>
> Cc: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>; KY Srinivasan
> <kys@xxxxxxxxxxxxx>; olaf@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> bjorn.helgaas@xxxxxxxxx; apw@xxxxxxxxxxxxx; devel@xxxxxxxxxxxxxxxxxxxxxx;
> leann.ogasawara@xxxxxxxxxxxxx; jasowang@xxxxxxxxxx
> Subject: Re: [PATCH 3/3] hv_netvsc: Implement VF matching based on
> serial numbers
>
> On Fri, 9 Dec 2016 21:31:25 +0000
> Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> wrote:
>
> > > -----Original Message-----
> > > From: Stephen Hemminger [mailto:stephen@xxxxxxxxxxxxxxxxxx]
> > > Sent: Friday, December 9, 2016 3:30 PM
> > > To: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>
> > > Cc: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>; KY Srinivasan
> > > <kys@xxxxxxxxxxxxx>; olaf@xxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > > bjorn.helgaas@xxxxxxxxx; apw@xxxxxxxxxxxxx;
> devel@xxxxxxxxxxxxxxxxxxxxxx;
> > > leann.ogasawara@xxxxxxxxxxxxx; jasowang@xxxxxxxxxx
> > > Subject: Re: [PATCH 3/3] hv_netvsc: Implement VF matching based on
> > > serial numbers
> > >
> > > On Fri, 9 Dec 2016 20:09:49 +0000
> > > Haiyang Zhang <haiyangz@xxxxxxxxxxxxx> wrote:
> > >
> > > > > -----Original Message-----
> > > > > From: Stephen Hemminger [mailto:stephen@xxxxxxxxxxxxxxxxxx]
> > > > > Sent: Friday, December 9, 2016 1:21 PM
> > > > > To: Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx>
> > > > > Cc: KY Srinivasan <kys@xxxxxxxxxxxxx>; olaf@xxxxxxxxx; Haiyang
> Zhang
> > > > > <haiyangz@xxxxxxxxxxxxx>; linux-kernel@xxxxxxxxxxxxxxx;
> > > > > bjorn.helgaas@xxxxxxxxx; apw@xxxxxxxxxxxxx;
> > > devel@xxxxxxxxxxxxxxxxxxxxxx;
> > > > > leann.ogasawara@xxxxxxxxxxxxx; jasowang@xxxxxxxxxx
> > > > > Subject: Re: [PATCH 3/3] hv_netvsc: Implement VF matching based
> on
> > > > > serial numbers
> > > > >
> > > > > On Fri, 9 Dec 2016 08:31:22 +0100
> > > > > Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > > On Fri, Dec 09, 2016 at 12:05:53AM +0000, KY Srinivasan wrote:
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Greg KH [mailto:gregkh@xxxxxxxxxxxxxxxxxxx]
> > > > > > > > Sent: Thursday, December 8, 2016 7:56 AM
> > > > > > > > To: KY Srinivasan <kys@xxxxxxxxxxxxx>
> > > > > > > > Cc: linux-kernel@xxxxxxxxxxxxxxx;
> devel@xxxxxxxxxxxxxxxxxxxxxx;
> > > > > > > > olaf@xxxxxxxxx; apw@xxxxxxxxxxxxx; vkuznets@xxxxxxxxxx;
> > > > > > > > jasowang@xxxxxxxxxx; leann.ogasawara@xxxxxxxxxxxxx;
> > > > > > > > bjorn.helgaas@xxxxxxxxx; Haiyang Zhang
> <haiyangz@xxxxxxxxxxxxx>
> > > > > > > > Subject: Re: [PATCH 3/3] hv_netvsc: Implement VF matching
> > > based on
> > > > > serial
> > > > > > > > numbers
> > > > > > > >
> > > > > > > > On Thu, Dec 08, 2016 at 12:33:43AM -0800,
> > > > > kys@xxxxxxxxxxxxxxxxxxxxxx
> > > > > > > > wrote:
> > > > > > > > > From: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>
> > > > > > > > >
> > > > > > > > > We currently use MAC address to match VF and synthetic
> NICs.
> > > > > Hyper-V
> > > > > > > > > provides a serial number to both devices for this
> purpose.
> > > This
> > > > > patch
> > > > > > > > > implements the matching based on VF serial numbers. This
> is
> > > the
> > > > > way
> > > > > > > > > specified by the protocol and more reliable.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Haiyang Zhang <haiyangz@xxxxxxxxxxxxx>
> > > > > > > > > Signed-off-by: K. Y. Srinivasan <kys@xxxxxxxxxxxxx>
> > > > > > > > > ---
> > > > > > > > > drivers/net/hyperv/netvsc_drv.c | 55
> > > > > > > > ++++++++++++++++++++++++++++++++++++---
> > > > > > > > > 1 files changed, 51 insertions(+), 4 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/drivers/net/hyperv/netvsc_drv.c
> > > > > > > > b/drivers/net/hyperv/netvsc_drv.c
> > > > > > > > > index 9522763..c5778cf 100644
> > > > > > > > > --- a/drivers/net/hyperv/netvsc_drv.c
> > > > > > > > > +++ b/drivers/net/hyperv/netvsc_drv.c
> > > > > > > > > @@ -1165,9 +1165,10 @@ static void
> netvsc_free_netdev(struct
> > > > > > > > net_device *netdev)
> > > > > > > > > free_netdev(netdev);
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > -static struct net_device *get_netvsc_bymac(const u8
> *mac)
> > > > > > > > > +static struct net_device *get_netvsc_byvfser(u32 vfser)
> > > > > > > > > {
> > > > > > > > > struct net_device *dev;
> > > > > > > > > + struct net_device_context *ndev_ctx;
> > > > > > > > >
> > > > > > > > > ASSERT_RTNL();
> > > > > > > > >
> > > > > > > > > @@ -1175,7 +1176,8 @@ static void
> netvsc_free_netdev(struct
> > > > > net_device
> > > > > > > > *netdev)
> > > > > > > > > if (dev->netdev_ops != &device_ops)
> > > > > > > > > continue; /* not a netvsc device */
> > > > > > > > >
> > > > > > > > > - if (ether_addr_equal(mac, dev->perm_addr))
> > > > > > > > > + ndev_ctx = netdev_priv(dev);
> > > > > > > > > + if (ndev_ctx->vf_serial == vfser)
> > > > > > > > > return dev;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > @@ -1205,21 +1207,66 @@ static void
> > > netvsc_free_netdev(struct
> > > > > > > > net_device *netdev)
> > > > > > > > > return NULL;
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > > +static u32 netvsc_get_vfser(struct net_device
> *vf_netdev)
> > > > > > > > > +{
> > > > > > > > > + struct device *dev;
> > > > > > > > > + struct hv_device *hdev;
> > > > > > > > > + struct hv_pcibus_device *hbus = NULL;
> > > > > > > > > + struct list_head *iter;
> > > > > > > > > + struct hv_pci_dev *hpdev;
> > > > > > > > > + unsigned long flags;
> > > > > > > > > + u32 vfser = 0;
> > > > > > > > > + u32 count = 0;
> > > > > > > > > +
> > > > > > > > > + for (dev = &vf_netdev->dev; dev; dev = dev->parent)
> {
> > > > > > > >
> > > > > > > > You are going to walk the whole device tree backwards?
> That's
> > > > > crazy.
> > > > > > > > And foolish. And racy and broken (what happens if the
> tree
> > > > > changes
> > > > > > > > while you do this?) Where is the lock being grabbed while
> > > this
> > > > > happens?
> > > > > > > > What about reference counts? Do you see other drivers
> ever
> > > doing
> > > > > this
> > > > > > > > (if you do, point them out and I'll go yell at them too...)
> > > > > > >
> > > > > > > Greg,
> > > > > > >
> > > > > > > We are registering for netdev events. Coming into this
> function,
> > > the
> > > > > caller
> > > > > > > guarantees that the list of netdevs does not change - we
> assert
> > > this
> > > > > on entry:
> > > > > > > ASSERT_RTNL(). We are only walking up the device tree for
> the
> > > > > netdevs whose
> > > > > > > state change is being notified to us - the device tree being
> > > walked
> > > > > here is limited to
> > > > > > > netdevs under question.
> > > > > >
> > > > > > But a netdev is a child of some type of "real" device, and you
> are
> > > now
> > > > > > walking the tree of all devices up to the "root" parent device,
> > > which
> > > > > > means you will hit PCI bridges, USB controllers, and all sorts
> of
> > > fun
> > > > > > things if you are a child of those types of devices.
> > > > > >
> > > > > > And can't you tell if the netdev for this event, really is
> "your"
> > > > > > netdev? Or are you getting called this for "all" netdevs?
> Sorry,
> > > I
> > > > > > don't know this api, any pointers to it would be appreciated.
> > > > > >
> > > > > > > We have a reference to the device and we know the device is
> not
> > > > > going away. Is it not
> > > > > > > safe to dereference the parent pointer - after all the child
> has
> > > > > taken a reference on
> > > > > > > the parent as part of device_add() call.
> > > > > >
> > > > > > It might be, and might not be. There's a reason you don't see
> > > this
> > > > > > pattern anywhere in the kernel because of this...
> > > > > >
> > > > > > > > > + if (!dev_is_vmbus(dev))
> > > > > > > > > + continue;
> > > > > > > >
> > > > > > > > Ick.
> > > > > > > >
> > > > > > > > Why isn't your parent pointer a vmbus device all the time?
> > > How
> > > > > could
> > > > > > > > you get burried down in the device hierarchy when you are
> the
> > > > > driver for
> > > > > > > > a specific bus type in the first place? How could this
> > > function
> > > > > ever be
> > > > > > > > called for a device that is NOT of this type?
> > > > > > >
> > > > > > > We get notified when state changes on any of the netdev
> devices
> > > in
> > > > > the system.
> > > > > > > Not all netdevs in the system belong to vmbus. Consider for
> > > instance
> > > > > the
> > > > > > > emulated NIC that can be configured. This is an emulated PCI
> NIC.
> > > We
> > > > > are only
> > > > > > > interested in netdevs that correspond to the VF instance
> that we
> > > are
> > > > > interested in.
> > > > > >
> > > > > > Can you "know" this is your netdev by some other way than
> having
> > > to
> > > > > walk
> > > > > > the device tree? Name? local device type? Something else?
> This
> > > > > seems
> > > > > > like an odd api in that everyone would have to do gyrations
> like
> > > this
> > > > > in
> > > > > > order to determine if the netdev is "theirs" or not...
> > > > >
> > > > > The scenario is SR-IOV on Hyper-V. In the case of VF device, the
> > > host
> > > > > hands the
> > > > > guest OS a PCI device for the virtual function device. The VF
> device
> > > is
> > > > > placed
> > > > > on a special synthetic PCI bus (ie not part of the other buses
> on
> > > the
> > > > > system).
> > > > > The VF device also has a synthetic network interface (netvsc)
> which
> > > > > lives
> > > > > on VMBUS. This code is about managing the interaction between
> the
> > > two.
> > > > >
> > > > > The association between VF and synthetic NIC is done in response
> to
> > > the
> > > > > VF network device being registered. Initial version was based on
> MAC
> > > > > address
> > > > > which is the same. Later refinement used permanent MAC address
> to
> > > > > avoid bugs if MAC address changed. This version is to use
> serial
> > > number
> > > > > instead which is safer than MAC address.
> > > > >
> > > > > The code to walk up/down maybe not be needed to find serial
> number.
> > > > > Perhaps a more direct single set of conditions is possible?
> > > > >
> > > > > Something like:
> > > > >
> > > > > In pci-hyperv.c
> > > > >
> > > > > u32 hv_pcifront_get_serial(struct pci_bus *bus, unsigned int
> devfn)
> > > > > {
> > > > > struct hv_pcibus_device *hbus
> > > > > = container_of(bus->sysdata,
> > > > > struct hv_pcibus_device, sysdata);
> > > > > struct hf_pci_dev *hpdev;
> > > > > u32 serial;
> > > > >
> > > > > hpdev = get_pcichild_wslot(hbus, devfn_to_wslot(pdev->devfn));
> > > > > if (!hpdev)
> > > > > return 0;
> > > > >
> > > > > serial = hpdev->devs.ser;
> > > > > put_pcichild(hpdev, hv_pcidev_ref_by_slot);
> > > > > return serial;
> > > > > }
> > > > >
> > > > > In netvsc_drv.c
> > > > >
> > > > > static u32 netvsc_get_vfser(struct net_device *vf_netdev)
> > > > > {
> > > > > struct device *dev = vf_netdev->dev.parent;
> > > > > struct pci_dev *pdev;
> > > > > u32 wslot;
> > > > >
> > > > > if (!dev || !dev_is_pci(dev))
> > > > > return 0;
> > > > >
> > > > > pdev = container_of(dev, struct pci_device, dev);
> > > > >
> > > > > return hv_pcifront_get_serial(pdev->bus, pdev->devfn);
> > > > > }
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > P.S: it would be good to be able to get win_slot out through
> sysfs
> > > as
> > > > > well for systemd/udev.
> > > >
> > > > Stephen,
> > > >
> > > > Thanks for suggestion. Actually, in my earlier implementation of
> this
> > > > feature (VF serial based matching), I thought about export a
> function
> > > > from vPCI driver, then calling it from netvsc. So I don't need to
> > > > move structs between headers... But, it creates a dependency of
> netvsc
> > > > on vPCI driver's symbol. So, even if on a VM without SRIOV, we
> have to
> > > > load vPCI driver, which we don't want.
> > > >
> > > > Also, hv_vpci device is 3 parent layers above the vf_netdevice:
> > > > Here is the VF drv hierarchy --
> > > > Should we assume it's always 3 parents above vf_netdevice, or
> search
> > > for it?
> > > >
> > > > [ 368.185259] HZINFO:NETDEV_REGISTER:
> > > > [ 368.185261] HZINFO: dev:ffff88007c10d518, bus: (null),
> > > busName:(null), drvName:(null)
> > > > [ 368.185262] HZINFO: dev:ffff88007c10c0a0, bus:ffffffff81ce4b60,
> > > busName:pci, drvName:ixgbevf
> > > > [ 368.185263] HZINFO: dev:ffff8800355c0000, bus: (null),
> > > busName:(null), drvName:(null)
> > > > [ 368.185264] HZINFO: dev:ffff8800355c5428, bus:ffffffffa0008160,
> > > busName:vmbus, drvName:hv_pci
> > > > [ 368.185264] HZINFO: dev:ffff88007c49e268, bus:ffffffff81ce9800,
> > > busName:acpi, drvName:vmbus
> > > > [ 368.185265] HZINFO: dev:ffff88007c48ea68, bus:ffffffff81ce9800,
> > > busName:acpi, drvName:(null)
> > > > [ 368.185266] HZINFO: dev:ffff88007c48aa68, bus:ffffffff81ce9800,
> > > busName:acpi, drvName:(null)
> > > > [ 368.185266] HZINFO: dev:ffff88007c48a268, bus:ffffffff81ce9800,
> > > busName:acpi, drvName:(null)
> > > > [ 368.185267] HZINFO: dev:ffff88007c489a68, bus:ffffffff81ce9800,
> > > busName:acpi, drvName:(null)
> > > >
> > > > Thanks,
> > > > - Haiyang
> > >
> > > Since this is a synthetic bus, the topology should not change unless
> > > host side
> > > software changes. The vf_netdev device has to be PCI device, so that
> is
> > > going to
> > > be certain. After that there maybe intermediate up to hv_pci. The
> code
> > > in hyperv-pci
> > > already has similar stuff (ie for read_config).
> >
> > Other netdevice, like emulated NIC can also trigger this notification.
> > They are not vPCI.
> >
> > Thanks,
> > - Haiyang
>
> Emulated NIC is already excluded in start of netvc notifier handler.
>
> static int netvsc_netdev_event(struct notifier_block *this,
> unsigned long event, void *ptr)
> {
> struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
>
> /* Skip our own events */
> if (event_dev->netdev_ops == &device_ops)
> return NOTIFY_DONE;
>
Emulated device is not based on netvsc. It's the native Linux (dec100M?)
Driver. So this line doesn't exclude it. And how about other NIC type
may be added in the future?
Thanks,
- Haiyang