Re: [PATCH v2 01/17] IB/Verbs: Implement new callback query_transport() for each HW

From: Doug Ledford
Date: Fri Apr 10 2015 - 13:10:57 EST


On Fri, 2015-04-10 at 03:48 -0400, ira.weiny wrote:
> > > diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
> > > index 18c1ece..a9587c4 100644
> > > --- a/drivers/infiniband/core/device.c
> > > +++ b/drivers/infiniband/core/device.c
> > > @@ -76,6 +76,7 @@ static int ib_device_check_mandatory(struct ib_device *device)
> > > } mandatory_table[] = {
> > > IB_MANDATORY_FUNC(query_device),
> > > IB_MANDATORY_FUNC(query_port),
> > > + IB_MANDATORY_FUNC(query_transport),
> > > IB_MANDATORY_FUNC(query_pkey),
> > > IB_MANDATORY_FUNC(query_gid),
> > > IB_MANDATORY_FUNC(alloc_pd),
> >
> > I'm concerned about the performance implications of this. The size of
> > this patchset already points out just how many places in the code we
> > have to check for various aspects of the device transport in order to do
> > the right thing. Without going through the entire list to see how many
> > are on critical hot paths, I'm sure some of them are on at least
> > partially critical hot paths (like creation of new connections). I
> > would prefer to see this change be implemented via a device attribute,
> > not a functional call query. That adds a needless function call in
> > these paths.
>
> I like the idea of a query_transport but at the same time would like to see the
> use of "transport" reduced. A reduction in the use of this call could
> eliminate most performance concerns.
>
> So can we keep this abstraction if at the end of the series we limit its use?

The reason I don't like a query is because the transport type isn't
changing. It's a static device attribute. The only devices that *can*
change their transport are mlx4 or mlx5 devices, and they tear down and
deregister their current device and bring up a new one when they need to
change transports or link layers. So, this really isn't something we
should query, this should be part of our static device attributes.
Every other query in the list above is for something that changes. This
is not.

> >
> > > diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
> > > index f93eb8d..83370de 100644
> > > --- a/drivers/infiniband/core/verbs.c
> > > +++ b/drivers/infiniband/core/verbs.c
> > > @@ -133,14 +133,16 @@ enum rdma_link_layer rdma_port_get_link_layer(struct ib_device *device, u8 port_
> > > if (device->get_link_layer)
> > > return device->get_link_layer(device, port_num);
> > >
> > > - switch (rdma_node_get_transport(device->node_type)) {
> > > + switch (device->query_transport(device, port_num)) {
> > > case RDMA_TRANSPORT_IB:
> > > + case RDMA_TRANSPORT_IBOE:
> > > return IB_LINK_LAYER_INFINIBAND;
> >
> > If we are perserving ABI, then this looks wrong. Currently, IBOE
> > returnsi transport IB and link layer Ethernet. It should not return
> > link layer IB, it does not support IB link layer operations (such as MAD
> > access).
>
> I think the original code has the bug.
>
> IBoE devices currently return a transport of IB but they probably never get
> here because they support the get_link_layer callback used a few lines above.
> So this "bug" was probably never hit.
>
> >
> > > case RDMA_TRANSPORT_IWARP:
> > > case RDMA_TRANSPORT_USNIC:
> > > case RDMA_TRANSPORT_USNIC_UDP:
> > > return IB_LINK_LAYER_ETHERNET;
> > > default:
> > > + BUG();
> > > return IB_LINK_LAYER_UNSPECIFIED;
> > > }
> > > }
> >
> > [ snip ]
> >
> > > diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
> > > index 65994a1..d54f91e 100644
> > > --- a/include/rdma/ib_verbs.h
> > > +++ b/include/rdma/ib_verbs.h
> > > @@ -75,10 +75,13 @@ enum rdma_node_type {
> > > };
> > >
> > > enum rdma_transport_type {
> > > + /* legacy for users */
> > > RDMA_TRANSPORT_IB,
> > > RDMA_TRANSPORT_IWARP,
> > > RDMA_TRANSPORT_USNIC,
> > > - RDMA_TRANSPORT_USNIC_UDP
> > > + RDMA_TRANSPORT_USNIC_UDP,
> > > + /* new transport */
> > > + RDMA_TRANSPORT_IBOE,
> > > };
> >
> > I'm also concerned about this. I would like to see this enum
> > essentially turned into a bitmap. One that is constructed in such a way
> > that we can always get the specific test we need with only one compare
> > against the overall value. In order to do so, we need to break it down
> > into the essential elements that are part of each of the transports.
> > So, for instance, we can define the two link layers we have so far, plus
> > reserve one for OPA which we know is coming:
> >
> > RDMA_LINK_LAYER_IB = 0x00000001,
> > RDMA_LINK_LAYER_ETH = 0x00000002,
> > RDMA_LINK_LAYER_OPA = 0x00000004,
> > RDMA_LINK_LAYER_MASK = 0x0000000f,
>
> I would reserve more bits here.

Sure. I didn't mean to imply that this was as large is these bit fields
would ever need to be, I just typed it up quickly at the time.

> >
> > We can then define the currently known high level transport types:
> >
> > RDMA_TRANSPORT_IB = 0x00000010,
> > RDMA_TRANSPORT_IWARP = 0x00000020,
> > RDMA_TRANSPORT_USNIC = 0x00000040,
> > RDMA_TRANSPORT_USNIC_UDP = 0x00000080,
> > RDMA_TRANSPORT_MASK = 0x000000f0,
>
> I would reserve more bits here.
>
> >
> > We could then define bits for the IB management types:
> >
> > RDMA_MGMT_IB = 0x00000100,
> > RDMA_MGMT_OPA = 0x00000200,
> > RDMA_MGMT_MASK = 0x00000f00,
>
> We at least need bits for SA / CM support.
>
> I said previously all device types support QP1 I was wrong... I forgot about
> USNIC devices. So the full management bit mask is.
>
>
> RDMA_MGMT_IB_MAD = 0x00000100,
> RDMA_MGMT_QP0 = 0x00000200,
> RDMA_MGMT_SA = 0x00000400,
> RDMA_MGMT_CM = 0x00000800,
> RDMA_MGMT_OPA_MAD = 0x00001000,
> RDMA_MGMT_MASK = 0x000fff00,
>
> With a couple of spares.
>
> The MAD stack is pretty agnostic to the types of MADs passing through it so we
> don't really need PM flags etc.
>
> >
> > Then we have space to define specific quirks:
> >
> > RDMA_SEPARATE_READ_SGE = 0x00001000,
> > RDMA_QUIRKS_MASK = 0xfffff000
>
> shift for spares...
>
> RDMA_SEPARATE_READ_SGE = 0x00100000,
> RDMA_QUIRKS_MASK = 0xfff00000
>
> >
> > Once those are defined, a few definitions for device drivers to use when
> > they initialize a device to set the bitmap to the right values:
> >
> > #define IS_IWARP (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IWARP |
> > RDMA_SEPARATE_READ_SGE)
> > #define IS_IB (RDMA_LINK_LAYER_IB | RDMA_TRANSPORT_IB | RDMA_MGMT_IB)
> > #define IS_IBOE (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_IB)
> > #define IS_USNIC (RDMA_LINK_LAYER_ETH | RDMA_TRANSPORT_USNIC)
> > #define IS_OPA (RDMA_LINK_LAYER_OPA | RDMA_TRANSPORT_IB | RDMA_MGMT_IB |
> > RDMA_MGMT_OPA)
> >
> > Then you need to define the tests:
> >
> > static inline bool
> > rdma_transport_is_iwarp(struct ib_device *dev, u8 port)
> > {
> > return dev->port[port]->transport & RDMA_TRANSPORT_IWARP;
> > }
> >
> > /* Note: this intentionally covers IB, IBOE, and OPA...use
> > rdma_dev_is_ib if you want to only get physical IB devices */
> > static inline bool
> > rdma_transport_is_ib(struct ibdev *dev)
> > {
> > return dev->port[port]->transport & RDMA_TRANSPORT_IB;
> > }
> >
> > rdma_port_is_ib(struct ib_device *dev, u8 port)
>
> I prefer
>
> rdma_port_link_layer_is_ib
> rdma_port_link_layer_is_eth

In my quick example, anything that started with rdma_transport* was
testing the high level transport attribute of any given port of any
given device, and anything that started with rdma_port* was testing the
link layer on a specific port of any given device.

>
> > {
> > return dev->port[port]->transport & RDMA_LINK_LAYER_IB;
> > }
> >
> > rdma_port_is_iboe(struct ib_device *dev, u8 port)
>
> I'm not sure what this means.
>
> rdma_port_req_iboe_addr seems more appropriate because what we really need to
> know is that this device requires an IBoE address format. (PKey is "fake",
> PathRecord is fabricated rather than queried, GRH is required in the AH
> conversion.)

TomAto, Tomato...port_is_ib implicity means port_req_iboe_addr.
>
> > {
> > return dev->port[port]->transport & IS_IBOE == IS_IBOE;
> > }
> >
> > rdma_port_is_usnic(struct ib_device *dev, u8 port)
>
> rdma_transport_is_usnic
>
> > {
> > return dev->port[port]->transport & RDMA_TRANSPORT_USNIC;
> > }
> >
> > rdma_port_is_opa(struct ib_device *dev, u8 port)
>
> rdma_port_link_layer_is_opa
>
> > {
> > return dev->port[port]->transport & RDMA_LINK_LAYER_OPA;
> > }
> >
> > rdma_port_is_iwarp(struct ib_device *dev, u8 port)
> > {
> > return rdma_transport_is_iwarp(dev, port);
>
> Why not call rdma_transport_is_iwarp?

As per my above statement, rdma_transport* tests were testing the high
level transport type, rdma_port* types were testing link layers. iWARP
has an Eth link layer, so technically port_is_iwarp makes no sense. But
since all the other types had a check too, I included port_is_iwarp just
to be complete, and if you are going to ask if a specific port is iwarp
as a link layer, it makes sense to say yes if the transport is iwarp,
not if the link layer is eth.

[ snip lots of stuff that is all correct ]

> > Patch 8/17:
> >
> > Here we can create a new test if we are using the bitmap I created
> > above:
> >
> > rdma_port_ipoib(struct ib_device *dev, u8 port)
> > {
> > return !(dev->port[port]->transport & RDMA_LINK_LAYER_ETH);
>
> I would prefer a link layer (or generic) bit mask rather than '&' of
> "transport" and "link layer". That just seems wrong.

I kept the name transport, but it's really a device attribute bitmap.
And of all the link layer types, eth is the only one for which IPoIB
makes no sense (even if it's possible to do). So, as long as the ETH
bit isn't set, we're good to go. But, calling it
dev->port[port]->attributes instead of transport would make it more
clear what it is.

> I guess rdma_dev_is_iboe is ok. But it seems like we are keying off the
> addresses not necessarily the devices.

They're one and the same. The addresses go with the device, the device
goes with the addresses. You never have one without the other. The
name of the check is not really important, just as long as it's clearly
documented. I get why you link the address variant, because it pops out
all the things that are special about IBoE addressing and calls out that
the issues need to be handled. However, saying requires_iboe_addr(),
while foreshadowing the work that needs done, doesn't actually document
the work that needs done. Whether we call is dev_is_iboe() or
requires_iboe_addr(), it would be good if the documentation spelled out
those specific requirements for reference sake.


--
Doug Ledford <dledford@xxxxxxxxxx>
GPG KeyID: 0E572FDD


Attachment: signature.asc
Description: This is a digitally signed message part