Re: linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon

From: Mario Kleiner
Date: Mon Jan 25 2016 - 14:30:30 EST




On 01/25/2016 07:51 PM, Daniel Vetter wrote:
On Mon, Jan 25, 2016 at 05:38:30PM +0100, Mario Kleiner wrote:
Readding Daniel, which somehow got dropped from the cc.

On 01/25/2016 03:53 PM, Ville Syrjälä wrote:
On Mon, Jan 25, 2016 at 02:44:53PM +0100, Mario Kleiner wrote:


On 01/25/2016 02:23 PM, Ville Syrjälä wrote:
On Mon, Jan 25, 2016 at 02:16:45PM +0100, Mario Kleiner wrote:


On 01/25/2016 05:15 AM, Michel Dänzer wrote:
On 23.01.2016 00:18, Ville Syrjälä wrote:
On Fri, Jan 22, 2016 at 12:06:00PM +0900, Michel Dänzer wrote:

[ Trimming KDE folks from Cc ]

On 21.01.2016 19:09, Daniel Vetter wrote:
On Thu, Jan 21, 2016 at 05:36:46PM +0900, Michel Dänzer wrote:
On 21.01.2016 16:58, Daniel Vetter wrote:

Can you please point me at the vblank on/off jump bug please?

AFAIR I originally reported it in response to
http://lists.freedesktop.org/archives/dri-devel/2015-August/087841.html
, but I can't find that in the archives, so maybe that was just on IRC.
See
http://lists.freedesktop.org/archives/dri-devel/2016-January/099122.html
. Basically, I ran into the bug fixed by your patch because the counter
jumped forward on every DPMS off, so it hit the 32-bit boundary after
just a few days.

Ok, so just uncovered the overflow bug.

Not sure what you mean by "just", but to be clear: The drm_vblank_on/off
counter jumping bug (similar to the bug this thread is about), which
exposed the overflow bug, is still alive and kicking in 4.5. It seems
to happen when turning off the CRTC:

[drm:drm_update_vblank_count] updating vblank count on crtc 0: current=218104694, diff=0, hw=916 hw_last=916
[drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3
[drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 0x7 p(2199,-45)@ 7304.307354 -> 7304.308006 [e 0 us, 0 rep]
[drm:radeon_get_vblank_counter_kms] crtc 0: dist from vblank start 3
[drm:drm_update_vblank_count] updating vblank count on crtc 0: current=218104694, diff=16776301, hw=1 hw_last=916

Not sure what bug we're talking about here, but here the hw counter
clearly jumps backwards.

[drm:radeon_get_vblank_counter_kms] Query failed! stat 3
[drm:radeon_get_vblank_counter_kms] Query failed! stat 3
[drm:drm_update_vblank_count] updating vblank count on crtc 1: current=0, diff=0, hw=0 hw_last=0
[drm:radeon_get_vblank_counter_kms] Query failed! stat 3
[drm:radeon_get_vblank_counter_kms] Query failed! stat 3
[drm:drm_update_vblank_count] updating vblank count on crtc 2: current=0, diff=0, hw=0 hw_last=0
[drm:radeon_get_vblank_counter_kms] Query failed! stat 3
[drm:radeon_get_vblank_counter_kms] Query failed! stat 3
[drm:drm_update_vblank_count] updating vblank count on crtc 3: current=0, diff=0, hw=0 hw_last=0
[drm:radeon_get_vblank_counter_kms] Query failed! stat 1
[drm:drm_calc_vbltimestamp_from_scanoutpos] crtc 0 : v 0x1 p(0,0)@ 7304.317140 -> 7304.317140 [e 0 us, 0 rep]
[drm:radeon_get_vblank_counter_kms] Query failed! stat 1
[drm:drm_update_vblank_count] updating vblank count on crtc 0: current=234880995, diff=16777215, hw=0 hw_last=1

Same here.

At least one of the jumps is expected, because this is around turning
off the CRTC for DPMS off. Don't know yet why there are two jumps back
though.


These things just don't happen on i915 because drm_vblank_off() and
drm_vblank_on() are always called around the times when the hw counter
might get reset. Or at least that's how it should be.

Which is of course the idea of Daniel's patch (which is what I'm getting
the above with) or Mario's patch as well, but clearly something's still
wrong. It's certainly possible that it's something in the driver, but
since calling drm_vblank_pre/post_modeset from the same places seems to
work fine (ignoring the regression discussed in this thread)... Do
drm_vblank_on/off require something else to handle this correctly?



I suspect it is because vblank_disable_and_save calls
drm_update_vblank_count() unconditionally, even if vblank irqs are
already off.

So on a manual display disable -> reenable you get something like

At disable:

Call to dpms-off --> atombios_crtc_dpms(DPMS_OFF) --> drm_vblank_off ->
vblank_disable_and_save -> irqs off, drm_update_vblank_count() computes
final count.

Then the crtc is shut down and its hw counter resets to zero.

At reenable:

Modesetting -> drm_crtc_helper_set_mode -> crtc_funcs->prepare(crtc) ->
atombios_crtc_prepare() -> atombios_crtc_dpms(DPMS_OFF) ->
drm_vblank_off -> vblank_disable_and_save -> A pointless
drm_update_vblank_count() while the hw counter is already reset to zero
--> Unwanted counter jump.


The problem doesn't happen on a pure modeset to a different video
resolution/refresh rate, as then we only have one call into
atombios_crtc_dpms(DPMS_OFF).

I think the fix is to fix vblank_disable_and_save() to only call
drm_update_vblank_count() if vblank irqs get actually disabled, not on
no-op calls. I will try that now.

It does that on purpose. Otherwise the vblank counter would appear to
have stalled while the interrupt was off.


Ok, that's what the comments there say, although i don't see atm. why
that perceived stall would be a big problem. I checked all callers of
vblank_disable_and_save(). They are all careful to not call that
function if vblanks are already disabled. The only exception is
drm_vblank_off(). If drm_vblank_off/on is supposed to protect kms
drivers which have resetting hw counters or other problematic behaviour
during modesets etc. then this will break. E.g., calling the vblank
timestamping stuff is also not safe/well-defined during modesets when
the timestamping constants are not (yet) updated to reflect the new mode
timing of the modeset in progress.

The idea is to maintain the appearance that the counter ticks all the
time as long as the crtc is active. While that may not be really
required in case if no one is currently interested in the vblank
counter, I think it's a nice thing to have just to make the behaviour
of the counter consistent.

As far as calling drm_vblank_off() after the hw counter got reset, well,
that not correct. It should be called before the reset.

What radeon does is calling drm_vblank_off at beginning of DPMS_OFF. The
first call to DMPS_OFF will call drm_vblank_off() and really disable
vblank-irqs if they were running, updating the counts/ts a last time. But
then the dpms off will reset the hw counter to zero. When one reenables the
display, a second call to DPMS_OFF will now call drm_vblank_off again when
it apparently shouldn't.

I just tested this patch, which fixes the counter jumps on radeon-kms with
my or Daniel's drm_vblank_off patches to radeon:

This might be due to the legacy helpers, which just love to redundantly
disable stuff that's off already. The problem I see with no-oping these
out is that for atomic drivers (which really should get this right) this
might paper over bugs: E.g. when you forget to call _off() when disabling
the crtc, then calling _on() twice in a row is indeed a serious bug.
Similar when you forget to call _on() and have multiple _off() calls in a
row.

So not sure what to do here.
-Daniel


Yes, the legacy helpers cause two calls to dpms off if one disables a display. First during display disable as intended. Then when one reenables the display during modesetting as part of crtc_funcs->prepare() - at least on radeon.

Maybe the minimum thing that would help is to just check for vblank->inmodeset in drm_vblank_off(). If that would be the case we'd know it is a redundant call and could no-op it and do a WARN_ON(vblank->inmodeset)?

drm_vblank_on() i don't know how to treat, but that one calls drm_reset_vblank_timestamp() which should be less problematic if called redundantly.

Now the patch i want to try next to fix the drm_vblank_pre/post_modeset regression in Linux 4.4/4.5 is to add a ...

if ((diff > 1) && vblank->inmodeset) diff = 1;

... to the bottom of drm_update_vblank_count(). That should hopefully restore the pre/post_modeset behavior as close to the original behavior as possible. As a side effect it would also prevent the counter jump caused by redundant calls to drm_vblank_off().

-mario


diff --git a/drivers/gpu/drm/drm_irq.c b/drivers/gpu/drm/drm_irq.c
index 607f493..d739d93 100644
--- a/drivers/gpu/drm/drm_irq.c
+++ b/drivers/gpu/drm/drm_irq.c
@@ -1313,7 +1313,10 @@ void drm_vblank_off(struct drm_device *dev, unsigned
int pipe)
spin_lock_irqsave(&dev->event_lock, irqflags);

spin_lock(&dev->vbl_lock);
- vblank_disable_and_save(dev, pipe);
+ DRM_DEBUG_VBL("crtc %d, vblank enabled %d\n", pipe,
vblank->enabled);
+
+ if (vblank->enabled)
+ vblank_disable_and_save(dev, pipe);
wake_up(&vblank->queue);

/*
@@ -1415,6 +1418,8 @@ void drm_vblank_on(struct drm_device *dev, unsigned
int pipe)
return;

spin_lock_irqsave(&dev->vbl_lock, irqflags);
+ DRM_DEBUG_VBL("crtc %d, vblank enabled %d\n", pipe,
vblank->enabled);
+
/* Drop our private "prevent drm_vblank_get" refcount */
if (vblank->inmodeset) {
atomic_dec(&vblank->refcount);



Another, maybe better, approach might be to no-op redundant calls to
drm_vblank_off() iff vblank->inmodeset and no-op redundant calls to
drm_vblank_on() iff !vblank->inmodeset.

-mario




-mario



Otherwise kms drivers would have to be careful to never call
drm_vblank_off multiple times before calling drm_vblank_on, but the help
text to drm_vblank_on() claims that unbalanced calls to these functions
are perfectly fine.

-mario