Re: random insta-reboots on AMD Phenom II

From: Markus Trippelsdorf
Date: Sat Sep 30 2017 - 11:21:15 EST


On 2017.09.30 at 10:20 -0400, Brian Gerst wrote:
> On Sat, Sep 30, 2017 at 8:47 AM, Markus Trippelsdorf
> <markus@xxxxxxxxxxxxxxx> wrote:
> > On 2017.09.30 at 13:53 +0200, Borislav Petkov wrote:
> >> On Sat, Sep 30, 2017 at 01:29:03PM +0200, Adam Borowski wrote:
> >> > On Sat, Sep 30, 2017 at 01:11:37PM +0200, Borislav Petkov wrote:
> >> > > On Sat, Sep 30, 2017 at 04:05:16AM +0200, Adam Borowski wrote:
> >> > > > Any hints how to debug this?
> >> > >
> >> > > Do
> >> > > rdmsr -a 0xc0010015
> >> > > as root and paste it here.
> >> >
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> > 1000010
> >> >
> >> > on both 4.13.4 and 4.14-rc2+.
> >>
> >> Boot into -rc2+ and do as root:
> >>
> >> # wrmsr -a 0xc0010015 0x1000018
> >>
> >> If the issue gets fixed then Mr. Luto better revert the new lazy TLB
> >> flushing fun'n'games for 4.14 before it is too late and that kernel
> >> releases b0rked.
> >
> > The issue does get fixed by setting TlbCacheDis to 1. I have been
> > running it for the last few weeks without any problems.
> > Performance is not affected at all. So it might by easier to just set
> > the bit for older AMD processors as a boot quirk.
> > Changing the TLB code so late might not be a good idea...
>
> Looking at the AMD K10 revision guide
> (http://support.amd.com/TechDocs/41322_10h_Rev_Gd.pdf), errata #298
> that this fixes should only apply to revisions DR-BA and DR-B2, which
> include the original Phenom, but not Phenom II. The Phenom II X6 is
> revision PH-E0, which does not have this errata.

It has nothing to do with errata #298. The new lazy TLB code causes
MCEs, because the page tables may now contain garbage.
See the long "Current mainline git (24e700e291d52bd2) hangs when
building e.g. perf" LKML thread.
--
Markus