Re: commit 7ffb791423c7 breaks steam game

From: Bert Karwatzki
Date: Wed Mar 26 2025 - 06:11:07 EST


Am Mittwoch, dem 26.03.2025 um 12:50 +1100 schrieb Balbir Singh:
> On 3/26/25 10:43, Balbir Singh wrote:
> > On 3/26/25 10:21, Bert Karwatzki wrote:
> > > Am Mittwoch, dem 26.03.2025 um 09:45 +1100 schrieb Balbir Singh:
> > > >
> > > >
> > > > The second region seems to be additional, I suspect that is HMM mapping from kgd2kfd_init_zone_device()
> > > >
> > > > Balbir Singh
> > > >
> > > Good guess! I inserted a printk into kgd2kfd_init_zone_device():
> > >
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> > > b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> > > index d05d199b5e44..201220e2ac42 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_migrate.c
> > > @@ -1049,6 +1049,8 @@ int kgd2kfd_init_zone_device(struct amdgpu_device *adev)
> > > pgmap->range.end = res->end;
> > > pgmap->type = MEMORY_DEVICE_PRIVATE;
> > > }
> > > + dev_info(adev->dev, "%s: range.start = 0x%llx ranges.end = 0x%llx\n",
> > > + __func__, pgmap->range.start, pgmap->range.end);
> > >
> > > pgmap->nr_range = 1;
> > > pgmap->ops = &svm_migrate_pgmap_ops;
> > >
> > >
> > > and get this in the case without nokaslr:
> > >
> > > [ T367] amdgpu 0000:03:00.0: kfd_migrate: kgd2kfd_init_zone_device:
> > > range.start = 0xafe00000000 ranges.end = 0xaffffffffff
> > >
> > > and this in the case with nokaslr:
> > >
> > > [ T365] amdgpu 0000:03:00.0: kfd_migrate: kgd2kfd_init_zone_device:
> > > range.start = 0x3ffe00000000 ranges.end = 0x3fffffffffff
> > >
> >
> > So we should ignore the second region then for the purposes of this issue.
> >
> > I think this now boils down to
> >
> > Why is the dma_get_required_mask set to all of addressable memory (46 bits)
> > when we have nokaslr
> >
>
> I think I know the root cause of the required_mask going up and hence the
> use of DMA32
>
> 1. HMM calls add_pages()
> 2. add_pages calls update_end_of_memory_vars()
> 3. This updates max_pfn and that causes required_mask to go up to 46 bits
>
> Do you have CONFIG_HSA_AMD_SVM enabled? Does turning it off, fix the issue?
>
> The actual issue is the update of max_pfn.
>
> Balbir Singh
>

Yes, turning off CONFIG_HSA_AMD_SVM fixes the issue, the strange memory
resource 
afe00000000-affffffffff : 0000:03:00.0
is gone.

If one would add a max_pyhs_addr argument to devm_request_free_mem_region()
(which return the resource addr in kgd2kfd_init_zone_device()) one could keep
the memory below the 44bit limit with CONFIG_HSA_AMD_SVM enabled.

Bert Karwatzki