Re: PROBLEM: Fatal Machine Check >= 3.13.5-101.fc19.x86_64

From: Borislav Petkov
Date: Fri Apr 18 2014 - 08:40:37 EST


On Fri, Apr 18, 2014 at 01:45:42PM +0200, Matthias Graf wrote:
> I applied your patch to linus' current master (3.15.0-rc1+) and indeed
> it does solve the issue for me!
>
> Thanks for your help.
>
> I would appreciated if you keep me posted on updates.

Ok, goodie, so this one really causes problems.

Btw, please try not to top-post:

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?

Thanks.

Btw, you could check whether your BIOS has an update, who knows, it
could be addressing this issue too. Simply look at the first couple of
lines of dmidecode output to check your current BIOS version and then go
and check, I think your board is Gigabyte, their site for updates.

Alex, Christian, guys, it looks like this commit below is causing some
sort of a core stall/livelock on Matthias' machine:

Hardware event. This is not a software error.
CPU 0 BANK 0
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: BUS Level-0 Local-CPU-originated-request Generic Memory-access Request-did-not-timeout Error
BQ_DCU_READ_TYPE BQ_ERR_HARD_TYPE BQ_ERR_HARD_TYPE
timeout BINIT (ROB timeout). No micro-instruction retired for some time
STATUS b200004000000800 MCGSTATUS 5

Reverting it fixes the issue so it must be something DPM-related on
RV770 in conjunction with his platform. I'm leaving in the rest for
reference, if you need more info, the thread starts here:

http://lkml.kernel.org/r/532C727F.1080803@xxxxxxxxxx

Thanks.

> >> Fine-grained bisection result:
> >>
> >> ab70b1dde73ff4525c3cd51090c233482c50f217 is the first bad commit
> >> commit ab70b1dde73ff4525c3cd51090c233482c50f217
> >> Author: Alex Deucher <alexander.deucher@xxxxxxx>
> >> Date: Fri Nov 1 15:16:02 2013 -0400
> >>
> >> drm/radeon: enable DPM by default on r7xx asics
> >>
> >> Seems to be stable on them.
> >>
> >> Signed-off-by: Alex Deucher <alexander.deucher@xxxxxxx>
> >>
> >> :040000 040000 f3262029b868df4d882f64b4deba6b9230e307ea
> >> 1f1dfca42763703a56e3cc82bb103608a24be94e M drivers
> >>
> >>
> >> Result is reasonable: I have a RV770 chip.
> >
> > Yes it is.
> >
> >> (Additional) Bug Report for Reference:
> >> https://bugzilla.redhat.com/show_bug.cgi?id=1085785
> >>
> >> Thanks for the instructions Borislav! At first, I was not completely
> >> sure what you expected me to do (this is my first kernel bug report :)).
> >
> > And you're doing good so far! :-)
> >
> >> If there is anymore more I can help you with, let me know.
> >
> > Ok, now we want to confirm that this patch is *actually* the culprit by
> > reverting it. Simply pull Linus' master branch to have the latest tree,
> > and then do:
> >
> > $ git checkout -b radeon-revert master
> >
> > so that you land on a throwaway branch where we can play. Then normally you
> > would do
> >
> > $ git revert ab70b1dde73ff4525c3cd51090c233482c50f217
> >
> > but that causes conflicts so I did it for you, see below. Simply apply
> > this patch ontop *without* doing the revert with git. Then build, boot
> > and test. We want to see whether it still generates those ROB timeout
> > machine checks. If all looks ok, then we're pretty sure we need to talk
> > about DPM with your GPU on your platform with Alex. :-)
> >
> > Feel free to ask any questions should something be not clear.
> >
> > Thanks.
> >
> > ---
> > From 0790e872f6d3c986d9ed36b850fd9d799dc422f9 Mon Sep 17 00:00:00 2001
> > From: Borislav Petkov <bp@xxxxxxx>
> > Date: Fri, 18 Apr 2014 11:43:12 +0200
> > Subject: [PATCH] Revert "drm/radeon: enable DPM by default on r7xx asics"
> >
> > This reverts commit ab70b1dde73ff4525c3cd51090c233482c50f217.
> >
> > Conflicts:
> > drivers/gpu/drm/radeon/radeon_pm.c
> > ---
> > drivers/gpu/drm/radeon/radeon_pm.c | 8 ++++----
> > 1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c
> > index ee738a524639..af693c4746da 100644
> > --- a/drivers/gpu/drm/radeon/radeon_pm.c
> > +++ b/drivers/gpu/drm/radeon/radeon_pm.c
> > @@ -1257,6 +1257,10 @@ int radeon_pm_init(struct radeon_device *rdev)
> > case CHIP_RV670:
> > case CHIP_RS780:
> > case CHIP_RS880:
> > + case CHIP_RV770:
> > + case CHIP_RV730:
> > + case CHIP_RV710:
> > + case CHIP_RV740:
> > case CHIP_BARTS:
> > case CHIP_TURKS:
> > case CHIP_CAICOS:
> > @@ -1273,10 +1277,6 @@ int radeon_pm_init(struct radeon_device *rdev)
> > else
> > rdev->pm.pm_method = PM_METHOD_PROFILE;
> > break;
> > - case CHIP_RV770:
> > - case CHIP_RV730:
> > - case CHIP_RV710:
> > - case CHIP_RV740:
> > case CHIP_CEDAR:
> > case CHIP_REDWOOD:
> > case CHIP_JUNIPER:
> >

> diff --git a/drivers/gpu/drm/radeon/radeon_pm.c b/drivers/gpu/drm/radeon/radeon_pm.c
> index ee738a524639..af693c4746da 100644
> --- a/drivers/gpu/drm/radeon/radeon_pm.c
> +++ b/drivers/gpu/drm/radeon/radeon_pm.c
> @@ -1257,6 +1257,10 @@ int radeon_pm_init(struct radeon_device *rdev)
> case CHIP_RV670:
> case CHIP_RS780:
> case CHIP_RS880:
> + case CHIP_RV770:
> + case CHIP_RV730:
> + case CHIP_RV710:
> + case CHIP_RV740:
> case CHIP_BARTS:
> case CHIP_TURKS:
> case CHIP_CAICOS:
> @@ -1273,10 +1277,6 @@ int radeon_pm_init(struct radeon_device *rdev)
> else
> rdev->pm.pm_method = PM_METHOD_PROFILE;
> break;
> - case CHIP_RV770:
> - case CHIP_RV730:
> - case CHIP_RV710:
> - case CHIP_RV740:
> case CHIP_CEDAR:
> case CHIP_REDWOOD:
> case CHIP_JUNIPER:




--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/