Re: [RESEND PATCH 2/3] nouveau: fix mixed normal and device private page migration

From: Ralph Campbell
Date: Mon Jun 22 2020 - 21:42:12 EST



On 6/22/20 5:30 PM, John Hubbard wrote:
On 2020-06-22 16:38, Ralph Campbell wrote:
The OpenCL function clEnqueueSVMMigrateMem(), without any flags, will
migrate memory in the given address range to device private memory. The
source pages might already have been migrated to device private memory.
In that case, the source struct page is not checked to see if it is
a device private page and incorrectly computes the GPU's physical
address of local memory leading to data corruption.
Fix this by checking the source struct page and computing the correct
physical address.

Signed-off-by: Ralph Campbell <rcampbell@xxxxxxxxxx>
---
 drivers/gpu/drm/nouveau/nouveau_dmem.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_dmem.c b/drivers/gpu/drm/nouveau/nouveau_dmem.c
index cc9993837508..f6a806ba3caa 100644
--- a/drivers/gpu/drm/nouveau/nouveau_dmem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_dmem.c
@@ -540,6 +540,12 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
ÂÂÂÂÂ if (!(src & MIGRATE_PFN_MIGRATE))
ÂÂÂÂÂÂÂÂÂ goto out;
+ÂÂÂ if (spage && is_device_private_page(spage)) {
+ÂÂÂÂÂÂÂ paddr = nouveau_dmem_page_addr(spage);
+ÂÂÂÂÂÂÂ *dma_addr = DMA_MAPPING_ERROR;
+ÂÂÂÂÂÂÂ goto done;
+ÂÂÂ }
+
ÂÂÂÂÂ dpage = nouveau_dmem_page_alloc_locked(drm);
ÂÂÂÂÂ if (!dpage)
ÂÂÂÂÂÂÂÂÂ goto out;
@@ -560,6 +566,7 @@ static unsigned long nouveau_dmem_migrate_copy_one(struct nouveau_drm *drm,
ÂÂÂÂÂÂÂÂÂÂÂÂÂ goto out_free_page;
ÂÂÂÂÂ }
+done:
ÂÂÂÂÂ *pfn = NVIF_VMM_PFNMAP_V0_V | NVIF_VMM_PFNMAP_V0_VRAM |
ÂÂÂÂÂÂÂÂÂ ((paddr >> PAGE_SHIFT) << NVIF_VMM_PFNMAP_V0_ADDR_SHIFT);
ÂÂÂÂÂ if (src & MIGRATE_PFN_WRITE)
@@ -615,6 +622,7 @@ nouveau_dmem_migrate_vma(struct nouveau_drm *drm,
ÂÂÂÂÂ struct migrate_vma args = {
ÂÂÂÂÂÂÂÂÂ .vmaÂÂÂÂÂÂÂ = vma,
ÂÂÂÂÂÂÂÂÂ .startÂÂÂÂÂÂÂ = start,
+ÂÂÂÂÂÂÂ .src_ownerÂÂÂ = drm->dev,

Hi Ralph,

This .src_owner setting does look like a required fix, but it seems like
a completely separate fix from what is listed in this patch's commit
description, right? (It feels like a casualty of rearranging the patches.)


thanks,

It's a bit more complex. There is a catch-22 here with the change to mm/migrate.c.
Without this patch or mm/migrate.c, a second call to clEnqueueSVMMigrateMem()
for the same address range will invalidate the GPU mapping to device private memory
created by the first call.
With this patch but not mm/migrate.c, the first call to clEnqueueSVMMigrateMem()
will fail to migrate normal anonymous memory to device private memory.
Without this patch but including the change to mm/migrate.c, a second call to
clEnqueueSVMMigrateMem() will crash the kernel because dma_map_page() will be
called with the device private PFN which is not a valid CPU physical address.
With both changes, a range of anonymous and device private pages can be migrated
to the GPU and the GPU page tables updated properly.