Re: mgag200 fails kdump kernel booting

From: Baoquan He
Date: Wed Feb 05 2020 - 02:31:57 EST


Hi Dave, Lyude,

On 07/02/19 at 06:51am, David Airlie wrote:
> On Wed, Jun 26, 2019 at 6:29 PM Baoquan He <bhe@xxxxxxxxxx> wrote:
> >
> > On 06/26/19 at 04:15pm, Baoquan He wrote:
> > > Hi Dave,
> > >
> > > We met an kdump kernel boot failure on a lenovo system. Kdump kernel
> > > failed to boot, but just reset to firmware to reboot system. And nothing
> > > is printed out.
> > >
> > > The machine is a big server, with 6T memory and many cpu, its graphic
> > > driver module is mgag200.
> > >
> > > When added 'earlyprintk=ttyS0' into kernel command line, it printed
> > > out only one line to console during kdump kernel booting:
> > > KASLR disabled: 'nokaslr' on cmdline.
> > >
> > > Then reset to firmware to reboot system.
> > >
> > > By further code debugging, the failure happened in
> > > arch/x86/boot/compressed/misc.c, during kernel decompressing stage. It's
> > > triggered by the vga printing. As you can see, in __putstr() of
> > > arch/x86/boot/compressed/misc.c, the code checks if earlyprintk= is
> > > specified, and print out to the target. And no matter if earlyprintk= is
> > > added or not, it will print to VGA. And printing to VGA caused it to
> > > reset to firmware. That's why we see nothing when didn't specify
> > > earlyprintk=, but see only one line of printing about the 'KASLR
> > > disabled'.
> >
> > Here I mean:
> > That's why we see nothing when didn't specify earlyprintk=, but see only
> > one line of printing about the 'KASLR disabled' message when
> > earlyprintk=ttyS0 added.
>
> Just to clarify, the original kernel is booted with mgag200 turned
> off, then kexec works, but if the original kernel loads mgag200, the
> kexec kernels resets hard when the VGA is used to write stuff out.
>
> This *might* be fixable in the controlled kexec case, but having an
> mgag200 shutdown path that tries to put the gpu back into a state
> where VGA doesn't die, but for the uncontrolled kexec it'll still be a
> problem, since once the gpu is up and running and VGA is disabled, it
> doesn't expect to see anymore VGA transactions.

Now we have got other two bug reports on different systems, finally
figured out it's the same issue as this after debugging. And adding
'nomodeset' can work around it.

With the help from our QA, tried to get more systems with mgag200,
seems not all of them have this issue, some of them with mgag200 can
jump to kdump well after panic.

Any suggestion about how to proceed? I can experiment. Or if you would
like to have a look when convenient, I can get one system to you to
check. Or, can we just use 'nomodeset' as work around and hold this
issue for the time being?

Appreciate if any suggestion or idea.

>
> Dave.
> >
> > >
> > > To confirm it's caused by VGA printing, I blacklist the mgag200 by
> > > writting it into /etc/modprobe.d/blacklist.conf. The kdump kernel can
> > > boot up successfully. And add 'nomodeset' can also make it work. So it's
> > > for sure mgag driver or related code have something wrong when booting
> > > code tries to re-init it.
> > >
> > > This is the only case we ever see, tend to pursuit fix in mgag200 driver
> > > side. Any idea or suggestion? We have two machines to be able to
> > > reproduce it stablly.
> > >
> > > Thanks
> > > Baoquan