Re: [PATCH 0/1] Fiji GPU audio register timeout when in BACO state

From: Takashi Iwai
Date: Wed Apr 29 2020 - 12:05:09 EST


On Wed, 29 Apr 2020 17:47:47 +0200,
Alex Deucher wrote:
>
> On Wed, Apr 29, 2020 at 11:27 AM Nicholas Johnson
> <nicholas.johnson-opensource@xxxxxxxxxxxxxx> wrote:
> >
> > On Wed, Apr 29, 2020 at 09:37:41AM +0200, Takashi Iwai wrote:
> > > On Tue, 28 Apr 2020 16:48:45 +0200,
> > > Nicholas Johnson wrote:
> > > >
> > > > > > > >
> > > > > > > > FWIW, I have a fiji board in a desktop system and it worked fine when
> > > > > > > > this code was enabled.
> > > > > > >
> > > > > > > Is the new DC code used for Fiji boards? IIRC, the audio component
> > > > > > > binding from amdgpu is enabled only for DC, and without the audio
> > > > > > > component binding the runtime PM won't be linked up, hence you can't
> > > > > > > power up GPU from the audio side access automatically.
> > > > > > >
> > > > > >
> > > > > > Yes, DC is enabled by default for all cards with runtime pm enabled.
> > > > >
> > > > > OK, thanks, I found that amdgpu got bound via component in the dmesg
> > > > > output, too:
> > > > >
> > > > > [ 21.294927] snd_hda_intel 0000:08:00.1: bound 0000:08:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
> > > > >
> > > > > This is the place soon after amdgpu driver gets initialized.
> > > > > Then we see later another initialization phase:
> > > > >
> > > > > [ 26.904127] rfkill: input handler enabled
> > > > > [ 37.264152] [drm] PCIE GART of 1024M enabled (table at 0x000000F400000000).
> > > > >
> > > > > here shows 10 seconds between them. Then, it complained something:
> > > > >
> > > > >
> > > > > [ 37.363287] [drm] UVD initialized successfully.
> > > > > [ 37.473340] [drm] VCE initialized successfully.
> > > > > [ 37.477942] amdgpu 0000:08:00.0: [drm] Cannot find any crtc or sizes
> > > >
> > > > The above would be me hitting WindowsKey+P to change screens, but with
> > > > no DisplayPort attached to Fiji, hence it unable to find crtc.
> > > >
> > > > >
> > > > > ... and go further, and hitting HD-audio error:
> > > > >
> > > > That would be me having attached the DisplayPort and done WindowsKey+P
> > > > again.
> > > >
> > > > > [ 38.936624] [drm] fb mappable at 0x4B0696000
> > > > > [ 38.936626] [drm] vram apper at 0x4B0000000
> > > > > [ 38.936626] [drm] size 33177600
> > > > > [ 38.936627] [drm] fb depth is 24
> > > > > [ 38.936627] [drm] pitch is 15360
> > > > > [ 38.936673] amdgpu 0000:08:00.0: fb1: amdgpudrmfb frame buffer device
> > > > > [ 40.092223] snd_hda_intel 0000:08:00.1: azx_get_response timeout, switching to polling mode: last cmd=0x00170500
> > > > >
> > > > > After this point, HD-audio communication was screwed up.
> > > > >
> > > > > This lastcmd in the above message is AC_SET_POWER_STATE verb for the
> > > > > root node to D0, so the very first command to power up the codec.
> > > > > The rest commands are also about the power up of each node, so the
> > > > > whole error indicate that the power up at runtime resume failed.
> > > > >
> > > > > So, this looks to me as if the device gets runtime-resumed at the bad
> > > > > moment?
> > > > It does. However, this is not going to be easy to pin down.
> > > >
> > > > I moved from Arch to Ubuntu, and it behaves differently. I cannot
> > > > trigger the bug in Ubuntu. Plus, it puts the GPUs asleep, even if
> > > > attached at boot, unlike Arch. I will continue to try to trigger it. But
> > > > even if this is a problem with the Linux distribution, it should not be
> > > > able to trigger a kernel mode bug, so we should persist with finding it.
> > >
> > > Sure, that's a bug to be fixed.
> > >
> > > This made me thinking what happens if we load the HD-audio driver very
> > > late. Could you try to blacklist snd-hda-intel module, then load it
> > > manually after plugging the DP monitor and activating it?
> > Attached dmesg-blacklist-*
> >
> > It is interesting. If I enable the monitor with the module unloaded, and
> > then load the module, I cannot trigger the bug, even if disabling the
> > monitor, waiting for GPU to sleep, and then waking again.
> >
> > Even if I wake monitor up, put to sleep again, and then insmod when
> > sleeping, it does not cause bug when waking again.
> >
> > Is there anything special about the first time the monitor is used?
> >
>
> What do you mean by used? Do you mean plugged in to the GPU or used
> in the GUI? It might be easier to debug this without a GUI involved.
> Can you try this at runlevel 3 or something equivalent for your
> distro?
>
> When the GPU is powered up, the driver gets an interrupt when a
> display is hotplugged and generates an event and userspace
> applications can listen for these events. When the GPU is powered
> down, there's no interrupt. I think most GUIs poll GPUs periodically
> to handle this case so they can detect a new display even when the GPU
> is off. Maybe we are getting some sort of race here. GUI queries GPU
> driver, causes GPU to wake up, checks attached displays, GPU driver
> resets runtime pm timer. GPU goes back to sleep. The detection
> updates the ELD data which causes the HDA driver to wake up. It
> assumes the hw is on and tries to query it. In the meantime, the GPU
> has already powered everything down again.

Well, but the code path there is the runtime PM resume of the audio
device and it means that GPU must have been runtime-resumed again
beforehand via the device link. So, it should have worked from the
beginning but in reality not -- that is, apparently some inconsistency
is found in the initial attempt of the runtime resume...


Takashi