Re: RFC: MTU for serving NFS on Infiniband

From: Roland Dreier
Date: Fri Aug 27 2010 - 12:20:44 EST


> Infiniband device driver needs to be fixed to do SG and checksum offload.
> Otherwise it is insane to try and run large MTU over it. I even wonder if
> the dev_change_mtu() function should reject > PAGESIZE mtu for devices
> that don't do scatter/gather or at least a raise a warning.

It's not possible to "fix" the driver to do checksum offload, since the
underlying hardware does not support it. Theoretically we could handle
SG but of course there's no point in that without checksum offload.

I think there is some confusion about what IPoIB is in this thread, so
let me try to give some basic background to help the discussion. There
are two "modes" that an IPoIB interface can operate in: datagram mode
and connected mode.

In datagram mode, packets given to the IPoIB driver are sent as IB
unreliable datagram messages, which means each skb turns into one packet
on the wire -- very much like the ethernet case. In this mode, the MTU
is limited by the MTU on the IB side, which is typically either 2K or 4K
depending on the adapter and the switches involved. Modern IB adapters
do support checksum offload and large send offload for datagrams, so we
can and do enable SG and IP_CSUM.

In connected mode, the IPoIB driver actually makes a reliable connection
to each peer. For reliable connections, IB adapters can actually send
messages up to 4GB, with the adapter handling all the segmentation and
transport level acks etc. -- the host system simply queues one work
request for each message of any size. These work requests do support
gather/scatter, but no existing adapter supports checksum offload for
messages on reliable connections.

However, since reliable connections support arbitrary sized messages, in
connected mode the IPoIB driver allows an MTU up to roughly the maximum
64K IP message size. (I don't think anyone has tried it with bigger
IPv6 jumbograms ;)

It does seem even with all the horrible memory allocation problems
caused by requiring huge linear skbs, connected mode does offer very
good performance for at least some real-world uses (although apparently
NFS is not one such use). In fact as far as I know, connected mode with
a huge MTU continues to outperform datagram mode even with LSO and LRO
(although I don't have any particularly recent numbers). So I don't
think we want to completely disallow such uses.

- R.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/