Re: linux-4.4 bisected: kwin5 stuck on kde5 loading screen with radeon

From: Mario Kleiner
Date: Fri Jan 15 2016 - 23:25:00 EST




On 01/15/2016 01:26 PM, Ville Syrjälä wrote:
On Fri, Jan 15, 2016 at 11:34:08AM +0100, Vlastimil Babka wrote:
Hi,

since kernel 4.4 I'm unable to login to kde5 desktop (on openSUSE
Tumbleweed). There's a screen with progressbar showing the startup,
which normally fades away after reaching 100%. But with kernel 4.4, the
progress gets stuck somewhere between 1/2 and 3/4 (not always the same).
Top shows that kwin is using few % of CPU's but mostly sleeps in poll().
When I kill it from another console, I see that everything has actually
started up, just the progressbar screen was obscuring it. The windows
obviously don't have decorations etc. Starting kwin manually again shows
me again the progressbar screen at the same position.


Depressing. I was stress-testing those patches with Linux 4.4 for days on 2 AMD gpu's (HD-4000 RV 730 and HD-5770) under KDE 5 Plasma 5.4.2 (KUbuntu 15.10, XOrg 1.17) and just retested Linux 4.4 on nouveau/radeon/intel also with XOrg 1.18 and XOrg master) with Linux 4.4 a few days ago and never encountered such a hang or other vblank related problem on KDE-5 or GNOME-3.

I'm currently running...

while xinit /usr/bin/ksplashqml --test -- :1 ; do echo yay; done

... in an endless loop on Linux 4.4 SMP PREEMPT on HD-5770 and so far i can't trigger a hang after hundreds of runs.

Does this also hang for you?

Hmm. Sounds like it could then be waiting for a vblank in the distant
future. There's that 1<<23 limit in the code though, but even with that
we end up with a max wait of ~38 hours assuming a 60Hz refresh rate.


xtrace suggests that ksplashqml seems to use classic OpenGL + glXSwapBuffers under DRI2. So no clever swap scheduling based on vblank counter values.

Stuff to try might include enabling drm.debug=0x2f, though that'll
generate a lot of stuff. Another option would be to use the drm vblank
tracepoints to try and catch what seq number it's waiting for and
where we're at currently. Or I suppose you could just hack
up drm_wait_vblank() to print an error message or something if the
requested seq number is in the future by, say, more than a few seconds,
and if that's the case then we could try to figure out why that happens.


I have suspected that kwin is waiting for some event, but nevertheless
tried bisecting the kernel between 4.3 and 4.4, which lead to:

# first bad commit: [4dfd64862ff852df7b1198d667dda778715ee88f] drm: Use
vblank timestamps to guesstimate how many vblanks were missed

I can confirm that 4.4 works if I revert the following commits:
63154ff230fc9255cc507af6277cd181943c50a1 "drm/amdgpu: Fixup hw vblank
counter/ts for new drm_update_vblank_count() (v3)"

d1145ad1e41b6c33758a856163198cb53bb96a50 "drm/radeon: Fixup hw vblank
counter/ts for new drm_update_vblank_count() (v2)"

The sha1s don't seem to match what I have, so not sure which kernel tree
you have, but looking at the radeon commit at least one thing
immediately caught my attention;

+ /* Bump counter if we are at >= leading edge of vblank,
+ * but before vsync where vpos would turn negative and
+ * the hw counter really increments.
+ */
+ if (vpos >= 0)
+ count++;

It's rather hard to see what it's really doing since the custom flags to
the get_scanout_position now cause it return non-standard things. But if
I'm reading things correctly it should really say something like:

if (vpos >= 0 && vpos < (vsync_start - vblank_start))
count++;

Hmm. Actually even that might not be correct since it could be using the
"fake" vblank start here, so might be it'd need to be something like:

if (vpos >= 0 && vpos < (vsync_start - vblank_start + lb_vblank_lead_lines)
count++;


The current code should be correct. vpos here returns the distance of hw vertical scanout position to the start of vblank. According to Alex and Harry Wentland of AMD's display team, and my testing of my two cards the hw vertical scanout position resets to zero at start line of vsync, therefore the "vpos" in that code becomes negative at start of vsync. At the same time the hw frame counter increments by one, making that "count++" to bump the returned count by +1 no longer neccessary.

If the reset of hw vertical scanout pos to zero and the increment of hw frame counter wouldn't happen at exactly the same time at start of vsync i could see how that could cause two successive queries of driver->get_vblank_counter() could report a count of N+1 and then N if the timing of both calls would be just perfectly right. That would cause the DRM code to falsely detect counter wraparound and jump the vblank counter forward by 2^24.

My tested gpu's had DCE-3 or DCE-4 display engines, Caicos has DCE-5, so could this be some hw quirk for DCE-5?


Also might be worth a shot to just ignore the hw frame counter. Eg.:

index e266ffc520d2..db728580549a 100644
--- a/drivers/gpu/drm/radeon/radeon_drv.c
+++ b/drivers/gpu/drm/radeon/radeon_drv.c
@@ -492,7 +492,6 @@ static struct drm_driver kms_driver = {
.lastclose = radeon_driver_lastclose_kms,
.set_busid = drm_pci_set_busid,
.unload = radeon_driver_unload_kms,
- .get_vblank_counter = radeon_get_vblank_counter_kms,
.enable_vblank = radeon_enable_vblank_kms,
.disable_vblank = radeon_disable_vblank_kms,
.get_vblank_timestamp = radeon_get_vblank_timestamp_kms,
diff --git a/drivers/gpu/drm/radeon/radeon_irq_kms.c b/drivers/gpu/drm/radeon/radeon_irq_kms.c
index 979f3bf65f2c..3c5fcab74152 100644
--- a/drivers/gpu/drm/radeon/radeon_irq_kms.c
+++ b/drivers/gpu/drm/radeon/radeon_irq_kms.c
@@ -152,11 +152,6 @@ int radeon_driver_irq_postinstall_kms(struct drm_device *dev)
{
struct radeon_device *rdev = dev->dev_private;

- if (ASIC_IS_AVIVO(rdev))
- dev->max_vblank_count = 0x00ffffff;
- else
- dev->max_vblank_count = 0x001fffff;
-
return 0;
}

assuming I'm reading the code correctly.


31ace027c9f1f8e0a2b09bbf961e4db7b1f6cf19 "drm: Don't zero vblank
timestamps from the irq handler"

ac0567a4b132fa66e3edf3f913938af9daf7f916 "drm: Add DRM_DEBUG_VBL()"

4dfd64862ff852df7b1198d667dda778715ee88f "drm: Use vblank timestamps to
guesstimate how many vblanks were missed"

All clean reverts, just needs some fixup on top to use abs() instead of
abs64() due to 79211c8ed19c055ca105502c8733800d442a0ae6.

Unfortunately I don't know if this is a kernel problem or kwin problem.
I tried to CC maintainers of both, advices what to try or what info to
provide welcome. The card is "CAICOS" with 1GB memory.


I think a drm.debug=0x21 setting and grep'ping the syslog for "vblank" should probably give useful info around the time of the hang.

Maybe also check XOrg.0.log for (WW) warnings related to flip.

thanks,
-mario


Thanks,
Vlastimil