Re: [ofa-general] Re: Demand paging for memory regions

From: Christoph Lameter
Date: Wed Feb 13 2008 - 13:52:20 EST


On Tue, 12 Feb 2008, Jason Gunthorpe wrote:

> But this isn't how IB or iwarp work at all. What you describe is a
> significant change to the general RDMA operation and requires changes to
> both sides of the connection and the wire protocol.

Yes it may require a separate connection between both sides where a
kind of VM notification protocol is established to tear these things down and
set them up again. That is if there is nothing in the RDMA protocol that
allows a notification to the other side that the mapping is being down
down.

> - In RDMA (iwarp and IB versions) the hardware page tables exist to
> linearize the local memory so the remote does not need to be aware
> of non-linearities in the physical address space. The main
> motivation for this is kernel bypass where the user space app wants
> to instruct the remote side to DMA into memory using user space
> addresses. Hardware provides the page tables to switch from
> incoming user space virtual addresses to physical addresess.

s/switch/translate I guess. That is good and those page tables could be
used for the notification scheme to enable reclaim. But they are optional
and are maintaining the driver state. The linearization could be
reconstructed from the kernel page tables on demand.

> Many kernel RDMA drivers (SCSI, NFS) only use the HW page tables
> for access control and enforcing the liftime of the mapping.

Well the mapping would have to be on demand to avoid the issues that we
currently have with pinning. The user API could stay the same. If the
driver tracks the mappings using the notifier then the VM can make sure
that the right things happen on exit etc etc.

> The page tables in the RDMA hardware exist primarily to support
> this, and not for other reasons. The pinning of pages is one part
> to support the HW page tables and one part to support the RDMA
> lifetime rules, the liftime rules are what cause problems for
> the VM.

So the driver software can tear down and establish page tables
entries at will? I do not see the problem. The RDMA hardware is one thing,
the way things are visible to the user another. If the driver can
establish and remove mappings as needed via RDMA then the user can have
the illusion of persistent RDMA memory. This is the same as virtual memory
providing the illusion of a process having lots of memory all for itself.


> - The wire protocol consists of packets that say 'Write XXX bytes to
> offset YY in Region RRR'. Creating a region produces the RRR label
> and currently pins the pages. So long as the RRR label is valid the
> remote side can issue write packets at any time without any
> further synchronization. There is no wire level events associated
> with creating RRR. You can pass RRR to the other machine in any
> fashion, even using carrier pigeons :)
> - The RDMA layer is very general (ala TCP), useful protocols (like SCSI)
> are built on top of it and they specify the lifetime rules and
> protocol for exchanging RRR.

Well yes of course. What is proposed here is an additional notification
mechanism (could even be via tcp/udp to simplify things) that would manage
the mappings at a higher level. The writes would not occur if the mapping
has not been established.

> This is your step 'A will then send a message to B notifying..'.
> It simply does not exist in the protocol specifications

Of course. You need to create an additional communication layer to get
that.

> What it boils down to is that to implement true removal of pages in a
> general way the kernel and HCA must either drop packets or stall
> incoming packets, both are big performance problems - and I can't see
> many users wanting this. Enterprise style people using SCSI, NFS, etc
> already have short pin periods and HPC MPI users probably won't care
> about the VM issues enough to warrent the performance overhead.

True maybe you cannot do this by simply staying within the protocol bounds
of RDMA that is based on page pinning if the RDMA protocol does not
support a notification to the other side that the mapping is going away.

If RDMA cannot do this then you would need additional ways of notifying
the remote side that pages/mappings are invalidated.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/