Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems

From: Dan Horák
Date: Tue Mar 30 2021 - 09:08:58 EST

Next message: Rob Herring: "Re: [PATCH v1 1/4] dt-bindings: memory: Add binding for MediaTek Common DRAM Controller"
Previous message: Jason Gunthorpe: "Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs"
In reply to: Xi Ruoyao: "Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems"
Next in thread: Christian König: "Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Tue, 30 Mar 2021 14:55:01 +0200
Christian König <christian.koenig@xxxxxxx> wrote:

> Am 30.03.21 um 14:04 schrieb Xi Ruoyao:
> > On 2021-03-30 03:40 +0800, Xi Ruoyao wrote:
> >> On 2021-03-29 21:36 +0200, Christian König wrote:
> >>> Am 29.03.21 um 21:27 schrieb Xi Ruoyao:
> >>>> Hi Christian,
> >>>>
> >>>> I don't think there is any constraint implemented to ensure `num_entries %
> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0`. For example, in `amdgpu_vm_bo_map()`:
> >>>>
> >>>>          /* validate the parameters */
> >>>>          if (saddr & AMDGPU_GPU_PAGE_MASK || offset & AMDGPU_GPU_PAGE_MASK
> >>>>              size == 0 || size & AMDGPU_GPU_PAGE_MASK)
> >>>>                  return -EINVAL;
> >>>>
> >>>> /* snip */
> >>>>
> >>>>          saddr /= AMDGPU_GPU_PAGE_SIZE;
> >>>>          eaddr /= AMDGPU_GPU_PAGE_SIZE;
> >>>>
> >>>> /* snip */
> >>>>
> >>>>          mapping->start = saddr;
> >>>>          mapping->last = eaddr;
> >>>>
> >>>>
> >>>> If we really want to ensure (mapping->last - mapping->start + 1) %
> >>>> AMDGPU_GPU_PAGES_IN_CPU_PAGE == 0, then we should replace
> >>>> "AMDGPU_GPU_PAGE_MASK"
> >>>> in "validate the parameters" with "PAGE_MASK".
> > It should be "~PAGE_MASK", "PAGE_MASK" has an opposite convention of
> > "AMDGPU_GPU_PAGE_MASK" :(.
> >
> >>> Yeah, good point.
> >>>
> >>>> I tried it and it broke userspace: Xorg startup fails with EINVAL with
> >>>> this
> >>>> change.
> >>> Well in theory it is possible that we always fill the GPUVM on a 4k
> >>> basis while the native page size of the CPU is larger. Let me double
> >>> check the code.
> > On my platform, there are two issues both causing the VM corruption. One is
> > fixed by agd5f/linux@fe001e7.
>
> What is fe001e7? A commit id? If yes then that is to short and I can't
> find it.

it's a gitlab shortcut for
https://gitlab.freedesktop.org/agd5f/linux/-/commit/fe001e70a55d0378328612be1fdc3abfc68b9ccc

Dan
>
> > Another is in Mesa from userspace: it uses
> > `dev_info->gart_page_size` as the alignment, but the kernel AMDGPU driver
> > expects it to use `dev_info->virtual_address_alignment`.
>
> Mhm, looking at the kernel code I would rather say Mesa is correct and
> the kernel driver is broken.
>
> The gart_page_size is limited by the CPU page size, but the
> virtual_address_alignment isn't.
>
> > If we can make the change to fill the GPUVM on a 4k basis, we can fix this issue
> > and make virtual_address_alignment = 4K. Otherwise, we should fortify the
> > parameter validation, changing "AMDGPU_GPU_PAGE_MASK" to "~PAGE_MASK". Then the
> > userspace will just get an EINVAL, instead of a slient VM corruption. And
> > someone should tell Mesa developers to fix the code in this case.
>
> I rather see this as a kernel bug. Can you test if this code fragment
> fixes your issue:
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 64beb3399604..e1260b517e1b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -780,7 +780,7 @@ int amdgpu_info_ioctl(struct drm_device *dev, void
> *data, struct drm_file *filp)
>                 }
>                 dev_info->virtual_address_alignment =
> max((int)PAGE_SIZE, AMDGPU_GPU_PAGE_SIZE);
>                 dev_info->pte_fragment_size = (1 <<
> adev->vm_manager.fragment_size) * AMDGPU_GPU_PAGE_SIZE;
> -               dev_info->gart_page_size = AMDGPU_GPU_PAGE_SIZE;
> +               dev_info->gart_page_size =
> dev_info->virtual_address_alignment;
>                 dev_info->cu_active_number = adev->gfx.cu_info.number;
>                 dev_info->cu_ao_mask = adev->gfx.cu_info.ao_cu_mask;
>                 dev_info->ce_ram_size = adev->gfx.ce_ram_size;
>
>
> Thanks,
> Christian.
>
> > --
> > Xi Ruoyao <xry111@xxxxxxxxxxxxxxxx>
> > School of Aerospace Science and Technology, Xidian University
> >
>

Next message: Rob Herring: "Re: [PATCH v1 1/4] dt-bindings: memory: Add binding for MediaTek Common DRAM Controller"
Previous message: Jason Gunthorpe: "Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs"
In reply to: Xi Ruoyao: "Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems"
Next in thread: Christian König: "Re: [PATCH] drm/amdgpu: fix an underflow on non-4KB-page systems"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]