Re: linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon

From: Mario Kleiner
Date: Thu Jan 21 2016 - 00:31:37 EST

On 01/21/2016 04:43 AM, Michel DÃnzer wrote:
On 21.01.2016 05:32, Mario Kleiner wrote:

So the problem is that AMDs hardware frame counters reset to
zero during a modeset. The old DRM code dealt with drivers doing that by
keeping vblank irqs enabled during modesets and incrementing vblank
count by one during each vblank irq, i think that's what
drm_vblank_pre_modeset() and drm_vblank_post_modeset() were meant for.

Right, looks like there's been a regression breaking this. I suspect the
problem is that vblank->last isn't getting updated from
drm_vblank_post_modeset. Not sure which change broke that though, or how
to fix it. Ville?

The whole logic has changed and the software counter updates are now driven all the time by the hw counter.

BTW, I'm seeing a similar issue with drm_vblank_on/off as well, which
exposed the bug fixed by 209e4dbc ("drm/vblank: Use u32 consistently for
vblank counters"). I've been meaning to track that down since then; one
of these days hopefully, but if anybody has any ideas offhand...

I spent the last few hours reading through the drm and radeon code and i think what should probably work is to replace the drm_vblank_pre/post_modeset calls in radeon/amdgpu by drm_vblank_off/on calls. These are apparently meant for drivers whose hw counters reset during modeset, and seem to reinitialize stuff properly and release clients queued vblank events to avoid blocking - not tested so far, just looked at the code.

Once drm_vblank_off is called, drm_vblank_get will no-op and return an error, so clients can't enable vblank irqs during the modeset - pageflip ioctl and waitvblank ioctl would fail while a modeset happens - hopefully userspace handles this correctly everywhere.

It would also cause radeons power management to not sync its actions to vblank if it would get invoked during a modeset, but that seems to be handled by a 200 msec timeout and hopefully only cause visual glitches - or invisible glitches while the crtc is blanked during modeset?

There could be another tiny race with the new "vblank counter bumping" logic from commit 5b5561b ("drm/radeon: Fixup hw vblank counters/ts ...") if drm_update_vblank_counter() would be called multiple times in quick succession within the "radeon_crtc->lb_vblank_lead_lines" scanlines before start of real vblank iff at the same time a modeset would happen and set radeon_crtc->lb_vblank_lead_lines to a smaller value due to a change in horizontal mode resolution. That needs a modeset to happen to a higher horizontal resolution just exactly when the scanout is in exactly the right 5 or so scanlines and some client is calling drm_vblank_get() to enable vblank irqs at the same time, but it would cause the same hang if it happened - not that likely to happen often, but still not nice, also Murphy's law... If we could switch to drm_vblank_off/on instead of drm_vblank_pre/post_modeset we could remove those race as well by forbidding any vblank irq related activity during a modeset.

I'll hack up a patch for demonstration now.