Re: [ofa-general] Re: Demand paging for memory regions

From: Christian Bell
Date: Tue Feb 12 2008 - 21:01:52 EST


On Tue, 12 Feb 2008, Christoph Lameter wrote:

> On Tue, 12 Feb 2008, Jason Gunthorpe wrote:
>
> > Well, certainly today the memfree IB devices store the page tables in
> > host memory so they are already designed to hang onto packets during
> > the page lookup over PCIE, adding in faulting makes this time
> > larger.
>
> You really do not need a page table to use it. What needs to be maintained
> is knowledge on both side about what pages are currently shared across
> RDMA. If the VM decides to reclaim a page then the notification is used to
> remove the remote entry. If the remote side then tries to access the page
> again then the page fault on the remote side will stall until the local
> page has been brought back. RDMA can proceed after both sides again agree
> on that page now being sharable.

HPC environments won't be amenable to a pessimistic approach of
synchronizing before every data transfer. RDMA is assumed to be a
low-level data movement mechanism that has no implied
synchronization. In some parallel programming models, it's not
uncommon to use RDMA to send 8-byte messages. It can be difficult to
make and hold guarantees about in-memory pages when many concurrent
RDMA operations are in flight (not uncommon in reasonably large
machines). Some of the in-memory page information could be shared
with some form of remote caching strategy but then it's a different
problem with its own scalability challenges.

I think there are very potential clients of the interface when an
optimistic approach is used. Part of the trick, however, has to do
with being able to re-start transfers instead of buffering the data
or making guarantees about delivery that could cause deadlock (as was
alluded to earlier in this thread). InfiniBand is constrained in
this regard since it requires message-ordering between endpoints (or
queue pairs). One could argue that this is still possible with IB,
at the cost of throwing more packets away when a referenced page is
not in memory. With this approach, the worse case demand paging
scenario is met when the active working set of referenced pages is
larger than the amount physical memory -- but HPC applications are
already bound by this anyway.

You'll find that Quadrics has the most experience in this area and
that their entire architecture is adapted to being optimistic about
demand paging in RDMA transfers -- they've been maintaining a patchset
to do this for years.

. . christian

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/