Re: nouveau shuts the machine down with v3.9-rc1 (temperature (72 C)hit the 'shutdown' threshold).

From: Konrad Rzeszutek Wilk
Date: Mon Mar 04 2013 - 16:41:38 EST


On Mon, Mar 04, 2013 at 08:21:48PM +0100, Martin Peres wrote:
> Hi Konrad,
>
> On 04/03/2013 19:40, Konrad Rzeszutek Wilk wrote:> After git merge
> ab7826595e9ec51a51f622c5fc91e2f59440481a
> > (Merge tag 'mfd-3.9-1' of
> git://git.kernel.org/pub/scm/linux/kernel/git/sameo/mfd-2.6)
> > the nouveau driver ends up shutting of the machine when booting.
> >
> >
> > I hadn't done a git bisection yet and was wondering if there are some
> > juice commits I ought to look at?
>
> Sure, no need to bisect, it is a new (apparently-broken-for-you) feature.
>
> The code is in /drivers/gpu/drm/nouveau/core/subdev/therm/
>
>
> >
> > Here is the serial console:
>
>
> > [ 6.940628] nouveau [ PTHERM][0000:00:0d.0] Thermal
> management: disabled
> > [ 6.957474] nouveau [ PTHERM][0000:00:0d.0] programmed
> thresholds [ 90(2), 95(3), 145(2), 135(5) ]
> > [ 6.966594] nouveau 6.975100] nouveau [
> PTHERM][0000:00:0d.0] Thermal management: automatic
> > [ 6.982059] nouveau [ PTHERM][0000:00:0d.0] temperature (88
> C) hit the 'downclock' threshold
> > [ 6.990680] nouveau [ PTHERM][0000:00:0d.0] temperature (88
> C) hit the 'critical' threshold
> > [ 6.999194] nouveau [ PTHERM][0000:00:0d.0] temperature (90
> C) hit the 'shutdown' threshold
>
> See, this is strange. If I believe the "programmed thresholds" line,
> the fanboost threshold is at 90°C, downclock is at 95°C, critical
> temperature is at 145°C and shutdown is at 135°C.
> So, from the BIOS side, things seem to be in fairly good shape
> (critical should be lower than shutdown, but that's OK).
>
> My theory is that your temperature sensor is very variable that
> would set off the shutdown alarm. So, either the sensor needs more
> settling time or the output is genuinely very variable.

You should see it when I boot it under Xen:

[ 8.427789] nouveau [ PTHERM][0000:00:0d.0] programmed thresholds [ 90(2), 95(3), 145(2), 135(5) ]^M^M
[ 8.427855] nouveau [ PTHERM][0000:00:0d.0] temperature (222 C) hit the 'fanboost' threshold^M^M
[ 8.427919] nouveau [ PTHERM][0000:00:0d.0] Thermal management: automatic^M^M
[ 8.427973] nouveau [ PTHERM][0000:00:0d.0] temperature (222 C) hit the 'downclock' threshold^M^M
[ 8.428036] nouveau [ PTHERM][0000:00:0d.0] temperature (222 C) hit the 'critical' threshold^M^M
[ 8.428099] nouveau [ PTHERM][0000:00:0d.0] temperature (222 C) hit the 'shutdown' threshold^M^M

>
> In the first case, we could fix that by increasing the settling time
> (at the expense of a longer boot period). We could also for a 10s
> wait at boot time before reading temperature.
> If this is the latter case, we only have the solution to average the
> temperature on several samples. I would need statistics on the
> variability in order to calculate a proper low-pass filter that
> wouldn't be too slow or too RAM/wakeup-intensive.
>
> I really hope the problem is the settling time!
>
>
> Here is what you can do to test the theory:
>
> Change the mdelay at line 41 of
> /drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c (http://cgit.freedesktop.org/nouveau/linux-2.6/tree/drivers/gpu/drm/nouveau/core/subdev/therm/nv40.c#n41)
> from 10 to 1000.
> Please also add an mdelay of 1000 between lines 44 and 45.

Let me do that tomorrow and report my findings.
>
> If it works with this patch, then try decreasing the delay to 20ms.
>
> In any way, I'll send some thermal patches tonight to be more
> resistant to long settling times.

Pls CC me in case you would like me also to test them with the
mdelay patch.

>
> Thanks for reporting!

Of course.
>
> Martin (mupuf)
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/