Re: vanilla kernels hang randomly under Fedora 10 on system with Radeon card

From: Bartlomiej Zolnierkiewicz
Date: Sun Dec 07 2008 - 14:23:19 EST


On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > On Thursday 04 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > On Wednesday 03 December 2008, Bartlomiej Zolnierkiewicz wrote:
> > > > On Tuesday 02 December 2008, Dave Airlie wrote:
> > > > > On Tue, Dec 2, 2008 at 8:42 AM, Bartlomiej Zolnierkiewicz
> > > > > <bzolnier@xxxxxxxxx> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > After Fedora 9 -> Fedora 10 upgrade vanilla kernels which previously
> > > > > > worked fine (next-20081128 and next-20081121) started to hang randomly
> > > > > > on my Pentium M / 855PM / RV350 laptop. Since (surprisingly) stock
> > > > > > Fedora kernel (2.6.27.5-117.fc10.i686) was not affected I got the idea
> > > > > > that either userspace changes uncovered some kernel regression or some
> > > > > > Fedora specific patch must be fixing the issue. Unfortunately vanilla
> > > > > > 2.6.27 also freezed so after the usual pain caused by hitting bunch of
> > > > > > unrelated problems [1] it turned out that drm-modesetting-radeon.patch
> > > > > > is the magic patch and CONFIG_DRM_RADEON_KMS is the magic change. With
> > > > > > the patch and enabling the option next-20081128 works stable again...
> > > > > >
> > > > > > Since the following error gets logged by kernel:
> > > > > >
> > > > > > [drm:drm_buffer_object_validate] *ERROR* Failed moving buffer. cef578c0 1444 4000027 10000a0
> > > > > > [drm:drm_buffer_object_validate] *ERROR* Out of aperture space or DRM memory quota.
> > > > > >
> > > > > > and it also seems that system is more responsive now (it was kind of
> > > > > > sluggish previously) my draft theory is that F9 -> F10 triggered some
> > > > > > AGP memory management bug and CONFIG_DRM_RADEON_KMS happens to fix it
> > > > > > but I'll leave figuring this up to the more knowledgeable people... ;)
> > > > >
> > > > > Well KMS is a purely Fedora thing, and enabling it completely avoids
> > > > > the old driver codepaths so
> > > > > while it might fix it, its more by accident than design.
> > > > >
> > > > > I'm trying to track down the rv3xx hangs with hpa at the moment as he
> > > > > sees them also, something in
> > > > > the 2.6.26->2.6.27 timeframe. I'm hoping running the 2.6.26 drm on
> > > > > the 2.6.27 will help narrow it down.
> > > > >
> > > > > Bisecting 2.6.26->2.6.27 might also help.
> > > >
> > > > It could be a different issue. I tried 2.6.26, 2.6.25 and 2.6.24
> > > > and they all hang (they all worked fine with Fedora 9)...
> > > >
> > > > I will try some older kernels but I start thinking that the xorg's ati
> > > > driver update is the main cause (xorg-x11-drv-ati-6.8.0-19.fc9.i386.rpm
> > > > -> xorg-x11-drv-ati-6.9.0-54.fc10.i386.rpm).
> > >
> > > I just went straight to trying downgrading the driver and the older driver
> > > indeed works fine. Then I tried to narrow down the problem and the lucky
> > > winner this time is the cute (== undocumented and unsigned-off) patch
> > > called radeon-6.9.0-remove-limit-heuristics.patch. The newer driver with
> > > only this patch reverted fixes hangs for vanilla kernels and drm errors
> > > for Fedora kernel. Also performance problems that I've noticed in the
> > > meantime (slower playback of 720p videos, sluggish window scrolling in
> > > kmail) are completely gone. That being said I'm not entirely sure whether
> >
> > I was too quick here -- performance problems are still present with
> > _Fedora_ kernel.
> >
> > Reassuming: what I currently need to do to get my gfx working properly
> > with F10 is reverting radeon-6.9.0-remove-limit-heuristics.patch from
> > xorg-x11-drv-ati and using vanilla kernel instead of Fedora's one.
>
> Heh, and it just hang on me after sending the above mail (it took like
> 1h or so for hang to occur) => the patch is just a very good trigger for
> the "real" bug. I'll now be running vanilla 6.9.0 to see how it goes...

It went well, "vanilla" in this case was xorg-x11-drv-ati-6.9.0-54.fc10
content _without_ radeon-modeset.patch and _with_ patch containing commit
da021c36bbdf3bca31ee50ebe01cdb9495c09b36 ("radeon_drm.h: remove kernel
defines") from xf86-video-ati git tree (needed to make things compile).

I tried to bisect it futher using radeon-gem-cs branch (using edge commit
deduced from radeon-modeset.patch) and managed to narrow it down further to
somewhere between

commit 44fb767aa95e5f0725386106b89d0782fd53b768
("radeon: fixup modesetting code after rebasing to master")

and

commit 12e71eaf7999520d23d50cfbcfc0299b2bdf7a9d
("port to using drm header files")

which left 66 commits which are completely unbisectable because of build
problems and bugfixes. I tried continuing with exporting commits from git
to patches, importing patches to quilt and shuffling them around to make
things bisectable again... Unfortunately this turned out to be more time
consuming than expected and I run out of time for this exercise...

Dave, do you have some ideas how can this be debugged further?
(i.e. rebuilding radeon-gem-cs tree would greatly help)

Or maybe it is not worth it until trying some updates/fixes first?

Thanks,
Bart
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/