On Tue, 30 Mar 2021 14:55:01 +0200
Christian König <christian.koenig@xxxxxxx> wrote:
Am 30.03.21 um 14:04 schrieb Xi Ruoyao:it's a gitlab shortcut for
On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:What is fe001e7? A commit id? If yes then that is to short and I can't
On 2021-03-29 21:36 +0200, Christian König wrote:It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
Hi Christian,
I don't think there is any constraint implemented to ensure `num_entries %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
/* validate the parameters */
if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
size == 0 || size & AMDGPU_GPU_PAGE_MASK)
return -EINVAL;
/* snip */
saddr /= AMDGPU_GPU_PAGE_SIZE;
eaddr /= AMDGPU_GPU_PAGE_SIZE;
/* snip */
mapping->start = saddr;
mapping->last = eaddr;
If we really want to ensure (mapping->last - mapping->start + 1) %
AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
"AMDGPU_GPU_PAGE_MASK"
in "validate the parameters" with "PAGE_MASK".
"AMDGPU_GPU_PAGE_MASK" :(.
On my platform, there are two issues both causing the VM corruption. One isYeah, good point.
I tried it and it broke userspace: Xorg startup fails with EINVAL withWell in theory it is possible that we always fill the GPUVM on a 4k
this
change.
basis while the native page size of the CPU is larger. Let me double
check the code.
fixed by agd5f/linux@fe001e7.
find it.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitlab.freedesktop.org%2Fagd5f%2Flinux%2F-%2Fcommit%2Ffe001e70a55d0378328612be1fdc3abfc68b9ccc&data=04%7C01%7Cchristian.koenig%40amd.com%7Cd16d123aaa01420ebd0808d8f37bbf2f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637527060812278536%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=5rFVLxSRnfHUGjhoiqx1e6SeROqbg4ZPef%2BxEvgv%2BTg%3D&reserved=0
Dan
Another is in Mesa from userspace: it usesMhm, looking at the kernel code I would rather say Mesa is correct and
`dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
expects it to use `dev_info->virtual_address_alignment`.
the kernel driver is broken.
The gart_page_size is limited by the CPU page size, but the
virtual_address_alignment isn't.
If we can make the change to fill the GPUVM on a 4k basis, we can fix this issueI rather see this as a kernel bug. Can you test if this code fragment
and make virtual_address_alignment = 4K. Otherwise, we should fortify the
parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
userspace will just get an EINVAL, instead of a slient VM corruption. And
someone should tell Mesa developers to fix the code in this case.
fixes your issue:
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 64beb3399604..e1260b517e1b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
*data, struct drm_file *filp)
}
dev_info->virtual_address_alignment =
max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
dev_info->pte_fragment_size = (1 <<
adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
- dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
+ dev_info->gart_page_size =
dev_info->virtual_address_alignment;
dev_info->cu_active_number = adev->gfx.cu_info.number;
dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
dev_info->ce_ram_size = adev->gfx.ce_ram_size;
Thanks,
Christian.
--
Xi Ruoyao <xry111@xxxxxxxxxxxxxxxx>
School of Aerospace Science and Technology, Xidian University