Re: [lustre-devel] [PATCH] staging: lustre: delete the filesystem from the tree.

From: NeilBrown
Date: Sun Jun 03 2018 - 23:55:27 EST


On Sun, Jun 03 2018, Dilger, Andreas wrote:

> On Jun 1, 2018, at 17:19, NeilBrown <neilb@xxxxxxxx> wrote:
>>
>> On Fri, Jun 01 2018, Doug Oucharek wrote:
>>
>>> Would it makes sense to land LNet and LNDs on their own first? Get
>>> the networking house in order first before layering on the file
>>> system?
>>
>> I'd like to turn that question on it's head:
>> Do we need LNet and LNDs? What value do they provide?
>> (this is a genuine question, not being sarcastic).
>>
>> It is a while since I tried to understand LNet, and then it was a
>> fairly superficial look, but I think it is an abstraction layer
>> that provides packet-based send/receive with some numa-awareness
>> and routing functionality. It sits over sockets (TCP) and IB and
>> provides a uniform interface.
>
> LNet is originally based on a high-performance networking stack called
> Portals (v3, http://www.cs.sandia.gov/Portals/), with additions for LNet
> routing to allow cross-network bridging.
>
> A critical part of LNet is that it is for RDMA and not packet-based
> messages. Everything in Lustre is structured around RDMA. Of course,
> RDMA is not possible with TCP so it just does send/receive under the
> covers, though it can do zero copy data sends (and at one time zero-copy
> receives, but those changes were rejected by the kernel maintainers).
> It definitely does RDMA with IB, RoCE, OPA in the kernel, and other RDMA
> network types not in the kernel (e.g. Cray Gemini/Aries, Atos/Bull BXI,
> and previously older network types no longer supported).

Thanks! That will probably help me understand it more easily next time
I dive in.

>
> Even with TCP it has some improvements for performance, such as using
> separate sockets for send and receive of large messages, as well as
> a socket for small messages that has Nagle disabled so that it does
> not delay those packets for aggregation.

That sounds like something that could benefit NFS...
pNFS already partially does this by virtue of the fact that data often
goes to a different server than control, so a different socket is
needed. I wonder if it could benefit from more explicit separate of
message sizes.


Thanks a lot for this background info!
NeilBrown

>
> In addition to the RDMA support, there is also multi-rail support in
> the out-of-tree version that we haven't been allowed to land, which
> can aggregate network bandwidth. While there exists channel bonding
> for TCP connections, that does not exist for IB or other RDMA networks.
>
>> That is almost a description of the xprt layer in sunrpc. sunrpc
>> doesn't have routing, but it does have some numa awareness (for the
>> server side at least) and it definitely provides packet-based
>> send/receive over various transports - tcp, udp, local (unix domain),
>> and IB.
>> So: can we use sunrpc/xprt in place of LNet?
>
> No, that would totally kill the performance of Lustre.
>
>> How much would we need to enhance sunrpc/xprt for this to work? What
>> hooks would be needed to implement the routing as a separate layer.
>>
>> If LNet is, in some way, much better than sunrpc, then can we share that
>> superior functionality with our NFS friends by adding it to sunrpc?
>
> There was some discussion at NetApp about adding a Lustre/LNet transport
> for pNFS, but I don't think it ever got beyond the proposal stage:
>
> https://tools.ietf.org/html/draft-faibish-nfsv4-pnfs-lustre-layout-07
>
>> Maybe the answer to this is "no", but I think LNet would be hard to sell
>> without a clear statement of why that was the answer.
>
> There are other users outside of the kernel tree that use LNet in addition
> to just Lustre. The Cray "DVS" I/O forwarding service[*] uses LNet, and
> another experimental filesystem named Zest[+] also used LNet.
>
> [*] https://www.alcf.anl.gov/files/Sugiyama-Wallace-Thursday16B-slides.pdf
> [+] https://www.psc.edu/images/zest/zest-sc07-paper.pdf
>
>> One reason that I would like to see lustre stay in drivers/staging (so I
>> do not support Greg's patch) is that this sort of transition of Lustre
>> to using an improved sunrpc/xprt would be much easier if both were in
>> the same tree. Certainly it would be easier for a larger community to
>> be participating in the work.
>
> I don't think the proposal to encapsulate all of the Lustre protocol into
> pNFS made a lot of sense, since this would have only really been available
> on Linux, at which point it would be better to use the native Lustre client
> rather than funnel everything through pNFS.
>
> However, _just_ using the LNet transport for (p)NFS might make sense. LNet
> is largely independent from Lustre (it used to be a separate source tree)
> and is very efficient over the network.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Principal Architect
> Intel Corporation

Attachment: signature.asc
Description: PGP signature