Re: [lustre-devel] [PATCH] staging: lustre: delete the filesystem from the tree.

From: NeilBrown
Date: Sun Jun 03 2018 - 23:55:27 EST

On Sun, Jun 03 2018, Dilger, Andreas wrote:

> On Jun 1, 2018, at 17:19, NeilBrown <neilb@xxxxxxxx> wrote:
>> On Fri, Jun 01 2018, Doug Oucharek wrote:
>>> Would it makes sense to land LNet and LNDs on their own first? Get
>>> the networking house in order first before layering on the file
>>> system?
>> I'd like to turn that question on it's head:
>> Do we need LNet and LNDs? What value do they provide?
>> (this is a genuine question, not being sarcastic).
>> It is a while since I tried to understand LNet, and then it was a
>> fairly superficial look, but I think it is an abstraction layer
>> that provides packet-based send/receive with some numa-awareness
>> and routing functionality. It sits over sockets (TCP) and IB and
>> provides a uniform interface.
> LNet is originally based on a high-performance networking stack called
> Portals (v3,, with additions for LNet
> routing to allow cross-network bridging.
> A critical part of LNet is that it is for RDMA and not packet-based
> messages. Everything in Lustre is structured around RDMA. Of course,
> RDMA is not possible with TCP so it just does send/receive under the
> covers, though it can do zero copy data sends (and at one time zero-copy
> receives, but those changes were rejected by the kernel maintainers).
> It definitely does RDMA with IB, RoCE, OPA in the kernel, and other RDMA
> network types not in the kernel (e.g. Cray Gemini/Aries, Atos/Bull BXI,
> and previously older network types no longer supported).

Thanks! That will probably help me understand it more easily next time
I dive in.

> Even with TCP it has some improvements for performance, such as using
> separate sockets for send and receive of large messages, as well as
> a socket for small messages that has Nagle disabled so that it does
> not delay those packets for aggregation.

That sounds like something that could benefit NFS...
pNFS already partially does this by virtue of the fact that data often
goes to a different server than control, so a different socket is
needed. I wonder if it could benefit from more explicit separate of
message sizes.

Thanks a lot for this background info!

> In addition to the RDMA support, there is also multi-rail support in
> the out-of-tree version that we haven't been allowed to land, which
> can aggregate network bandwidth. While there exists channel bonding
> for TCP connections, that does not exist for IB or other RDMA networks.
>> That is almost a description of the xprt layer in sunrpc. sunrpc
>> doesn't have routing, but it does have some numa awareness (for the
>> server side at least) and it definitely provides packet-based
>> send/receive over various transports - tcp, udp, local (unix domain),
>> and IB.
>> So: can we use sunrpc/xprt in place of LNet?
> No, that would totally kill the performance of Lustre.
>> How much would we need to enhance sunrpc/xprt for this to work? What
>> hooks would be needed to implement the routing as a separate layer.
>> If LNet is, in some way, much better than sunrpc, then can we share that
>> superior functionality with our NFS friends by adding it to sunrpc?
> There was some discussion at NetApp about adding a Lustre/LNet transport
> for pNFS, but I don't think it ever got beyond the proposal stage:
>> Maybe the answer to this is "no", but I think LNet would be hard to sell
>> without a clear statement of why that was the answer.
> There are other users outside of the kernel tree that use LNet in addition
> to just Lustre. The Cray "DVS" I/O forwarding service[*] uses LNet, and
> another experimental filesystem named Zest[+] also used LNet.
> [*]
> [+]
>> One reason that I would like to see lustre stay in drivers/staging (so I
>> do not support Greg's patch) is that this sort of transition of Lustre
>> to using an improved sunrpc/xprt would be much easier if both were in
>> the same tree. Certainly it would be easier for a larger community to
>> be participating in the work.
> I don't think the proposal to encapsulate all of the Lustre protocol into
> pNFS made a lot of sense, since this would have only really been available
> on Linux, at which point it would be better to use the native Lustre client
> rather than funnel everything through pNFS.
> However, _just_ using the LNet transport for (p)NFS might make sense. LNet
> is largely independent from Lustre (it used to be a separate source tree)
> and is very efficient over the network.
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation

Attachment: signature.asc
Description: PGP signature