Re: AMD graphics performance regression in 4.15 and later

From: Michel DÃnzer
Date: Mon Apr 23 2018 - 06:24:02 EST


On 2018-04-20 09:40 PM, Felix Kuehling wrote:
> On 2018-04-20 10:47 AM, Michel DÃnzer wrote:
>> On 2018-04-11 11:37 AM, Christian KÃnig wrote:
>>> Am 11.04.2018 um 06:00 schrieb Gabriel C:
>>>> 2018-04-09 11:42 GMT+02:00 Christian KÃnig
>>>> <ckoenig.leichtzumerken@xxxxxxxxx>:
>>>>> Am 07.04.2018 um 00:00 schrieb Jean-Marc Valin:
>>>>>> Hi Christian,
>>>>>>
>>>>>> Thanks for the info. FYI, I've also opened a Firefox bug for that at:
>>>>>> https://bugzilla.mozilla.org/show_bug.cgi?id=1448778
>>>>>> Feel free to comment since you have a better understanding of what's
>>>>>> going on.
>>>>>>
>>>>>> One last question: right now I'm running 4.15.0 with the "offending"
>>>>>> patch reverted. Is that safe to run or are there possible bad
>>>>>> interactions with other changes.
>>>>> That should work without problems.
>>>>>
>>>>> But I just had another idea as well, if you want you could still test
>>>>> the
>>>>> new code path which will be using in 4.17.
>>>>>
>>>> While Firefox may do some strange things is not about only Firefox.
>>>>
>>>> With your patches my EPYC box is unusable with 4.15++ kernels.
>>>> The whole Desktop is acting weird. This one is using
>>>> an Cape Verde PRO [Radeon HD 7750/8740 / R7 250E] GPU.
>>>>
>>>> Box is 2 * EPYC 7281 with 128 GB ECC RAM
>>>>
>>>> Also a 14C Xeon box with a HD7700 is broken same way.
>>> The hardware is irrelevant for this. We need to know what software stack
>>> you use on top of it.
>>>
>>> E.g. desktop environment/Mesa and DDX version etc...
>>>
>>>> Everything breaks in X .. scrolling , moving windows , flickering etc.
>>>>
>>>>
>>>> reverting f4c809914a7c3e4a59cf543da6c2a15d0f75ee38 and
>>>> 648bc3574716400acc06f99915815f80d9563783
>>>> from an 4.15 kernel makes things work again.
>>>>
>>>>
>>>>> Backporting all the detection logic is to invasive, but you could
>>>>> just go
>>>>> into drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c and forcefull use the other
>>>>> code path.
>>>>>
>>>>> Just look out for "#ifdef CONFIG_SWIOTLB" checks and disable those.
>>>>>
>>>> Well you really can't be serious about these suggestions ? Are you ?
>>>>
>>>> Telling peoples to #if 0 random code is not a solution.
>>> That is for testing and not a permanent solution.
>>>
>>>> You broke existsing working userland with your patches and at least
>>>> please fix that for 4.16.
>>>>
>>>> I can help testing code for 4.17/++ if you wish but that is
>>>> *different* storry.
>>> Please test Alex's amd-staging-drm-next branch from
>>> git://people.freedesktop.org/~agd5f/linux.
>> I think we're still missing something here.
>>
>> I'm currently running 4.16.2 + the DRM subsystem changes which are going
>> into 4.17 (so I have the changes Christian is referring to) with a
>> Kaveri APU, and I'm seeing similar symptoms as described by Jean-Marc.
>> Some observations:
>>
>> Firefox, Thunderbird, or worst, gnome-shell, can freeze for up to on the
>> order of a minute, during which the kernel is spending most of one
>> core's cycles inside alloc_pages (__alloc_pages_nodemask to be more
>> precise), called from ttm_alloc_new_pages.
> Philip debugged a similar problem with a KFD memory stress test about
> two weeks ago, where the kernel was seemingly stuck in an infinite loop
> trying to allocate huge pages. I'm pasting his analysis for the record:
>
>> [...] it uses huge_flags GFP_TRANSHUGE to call alloc_pages(), this
>> seems a corner case inside __alloc_pages_slowpath(), it never exits
>> but goes to retry path every time. It can reclaim pages and
>> did_some_progress (as a result, no_progress_loops is reset to 0 every
>> loop, never reach MAX_RECLAIM_RETRIES) but cannot finish huge page
>> allocations under this specific memory pressure.Â
> As a workaround to unblock our release branch testing we removed
> transparent huge page allocation from ttm_get_pages. We're seeing this
> as far back as 4.13 on our release branch.

Thanks for sharing this. In the future, please raise issues like this on
the public mailing lists from the beginning.


> If we're really talking about the same problem, I don't think it's
> caused by recent page allocator changes, but rather exposed by recent
> TTM changes.

It sounds related, but probably not exactly the same problem. I already
had the TTM code using GFP_TRANSHUGE before I ran into the issue. Also,
__alloc_pages_slowpath eventually succeeds for me, it can just take up
to about a minute.

I'm currently testing using (GFP_TRANSHUGE_LIGHT | __GFP_NORETRY)
instead of GFP_TRANSHUGE in TTM.


--
Earthling Michel DÃnzer | http://www.amd.com
Libre software enthusiast | Mesa and X developer