Re: commit 7ffb791423c7 breaks steam game
From: Balbir Singh
Date: Thu Mar 27 2025 - 18:03:55 EST
On 3/27/25 21:53, Ingo Molnar wrote:
>
> * Balbir Singh <balbirs@xxxxxxxxxx> wrote:
>
>>> Yes, turning off CONFIG_HSA_AMD_SVM fixes the issue, the strange memory
>>> resource
>>> afe00000000-affffffffff : 0000:03:00.0
>>> is gone.
>>>
>>> If one would add a max_pyhs_addr argument to devm_request_free_mem_region()
>>> (which return the resource addr in kgd2kfd_init_zone_device()) one could keep
>>> the memory below the 44bit limit with CONFIG_HSA_AMD_SVM enabled.
>>>
>>
>> Thanks for reporting the result, does this patch work
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 01ea7c6df303..14f42f8012ab 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -968,8 +968,9 @@ int add_pages(int nid, unsigned long start_pfn, unsigned long nr_pages,
>> WARN_ON_ONCE(ret);
>>
>> /* update max_pfn, max_low_pfn and high_memory */
>> - update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
>> - nr_pages << PAGE_SHIFT);
>> + if (!params->pgmap)
>> + update_end_of_memory_vars(start_pfn << PAGE_SHIFT,
>> + nr_pages << PAGE_SHIFT);
>>
>> return ret;
>> }
>>
>> It basically prevents max_pfn from moving when the inserted memory is
>> zone_device.
>>
>> FYI: It's a test patch and will still create issues if the amount of
>> present memory (physically) is very high, because the driver need to
>> enable use_dma32 in that case.
>
> So this patch does the trick for Bert, and I'm wondering what the best
> fix here would be overall, because it's a tricky situation.
>
> Am I correct in assuming that with enough physical memory this bug
> would trigger, with and without nokaslr?
Enough physical memory here refers to the physical memory being larger
than the dma addressable bits of the device. So effectively anything
running into several 10's of TiB of memory. Even today we assume the device
can address up to 10TiB (because max_pfn touches that limit when the
zone device path gets activated)
>
> I *think* the best approach going forward would be to add the above
> quirk the the x86 memory setup code, but also issue a kernel warning at
> that point with all the relevant information included, so that the
> driver's use_dma32 bug can at least be indicated?
>
> That might also trigger for other systems, because if this scenario is
> so spurious, I doubt it's the only affected driver ...
>
I would like to use the patch to prevent device private memory from
bumping up max_pfn, but I am not sure what the overall impact of restricting
max_pfn to just end of memory is. I suspect it's OK, since max_pfn only
changes on memory hotplug and DEVICE_PRIVATE memory should not be bumping up
max_pfn.
For the warnings, one consideration if around where to put them, whether
those need to come into the respective drivers or if we run into a device
that has limited addressing capability
Effectively dma_addressing_limited() returning true, I wonder if that is a
heavy hammer, but a WARN_ON_ONCE can inform the user/administrator that the
system will need bounce buffers and that could impact performance. We'd need
advice from the dma maintainers. Not sure if the DRM subsystem or drivers want
to do specific things for ttm_device_init()
Balbir Singh