Re: 3.14-rc7 crashes in drm ([PATCH] a crash in mga_driver_irq_uninstall)
From: Daniel Vetter
Date: Mon Mar 24 2014 - 16:26:21 EST
On Mon, Mar 24, 2014 at 01:17:12PM -0400, Mikulas Patocka wrote:
>
>
> On Mon, 24 Mar 2014, Daniel Vetter wrote:
>
> > On Mon, Mar 24, 2014 at 07:45:47AM +1000, Dave Airlie wrote:
> > > On Mon, Mar 24, 2014 at 7:27 AM, Andreas Mohr <andi@xxxxxxxx> wrote:
> > > > On Sun, Mar 23, 2014 at 09:39:16AM -0700, Linus Torvalds wrote:
> > > >> On Sun, Mar 23, 2014 at 5:15 AM, Andreas Mohr <andi@xxxxxxxx> wrote:
> > > >> >
> > > >> > which did end up flawless on 3.12.0-rc2+, too
> > > >> > (but failed to improve the issue on 3.14.0-rc7+).
> > > >> >
> > > >> > So, for all intents and purposes, drm infrastructure seems unavoidably
> > > >> > (neither dri disable nor libdrm upgrade helps) affected.
> > > >> > Does anyone know which change caused that issue?
> > > >> > (I'm asking because bisect here would be relatively painful).
> > > >>
> > > >> So 3.12-rc2 works. Does 3.13 work? Is this a regression in the current
> > > >> 3.14 rc only, or did it happen already in the previous release?
> > > >
> > > > Hmm, given that Mikulas in
> > > > https://lkml.org/lkml/2014/2/26/537
> > > > offered a diff of linux-3.13.5 files, it truly seems (shock! ack! noo!)
> > > > that that indeed may have been a regression at <= 3.13 proper even
> > > > (which may pose interesting questions about the level of testing coverage
> > > > we still enjoy [not!?] in this hardware area).
>
> That patch drops a mutex, so it is not correct. There is mutex resursion -
> we need to uninstall the irq in drm_master_destroy, because here we are
> committed to destroy the device. But the routine that uninstalls the irq
> takes struct_mutex, which is already held in drm_master_destroy.
>
> I suppose that the person who maintains drm reworks the patch so that it's
> correct:
>
> - could we use a different mutex to protect the irq in drm_irq.c? Or
> possibly no mutex at all and use cmpxchg to manipulate the variable
> dev->irq_enabled? - this seems like the best solution. But I am not sure
> if the code in drm_irq.c somehow depends on the facts that other parts of
> the drm subsystem take struct_mutex.
>
> - could we pass a new argument to drm_irq_uninstall that tells it not to
> take the mutex? drm_master_destroy would set this argument to 1.
> drm_master_destroy is mostly called with struct_mutex held, but there may
> be places in vmwgfx_drv.c where drm_master_put (which calls
> drm_master_destroy) may be called without struct_mutex held.
>
> Is it true that drm_master_destroy can be called without struct_mutex
> held? I don't know.
>
>
> I think drm maintainer should sort out the above issues and modify the
> patch accordingly.
>
> > > > Oh well, seems I'll have to prepare/build 3.13 now...
> > >
> > > It's > 15 year old hardware, so yes I believe we have close to 0
> > > testing coverage on it outside of distros,
> > >
> > > I'm not even sure I have one anymore, I might be able to test an MGA in one box.
> >
> > I haven't done a full read of all the related code, but this smells like a
> > similar bug I've hit all over the place in the i810 driver (another one of
> > those undead drm drivers of yonders). Ingredients:
> >
> > 1) Xorg creates a drm mapping of the register space.
> > 2) Xorg tells the hw-specific drm which drm mapping has the hw registers,
> > and the driver uses that. Iirc this has been done as some form of OS
> > abstraction. Also note that these mappings aren't refcounted, so the first
> > guy to call drm_rmmap wins.
> >
> > -> All hell breaks loose if Xorg dies and takes all it's mappings with it
> > (in master_destroy, since the Xorg /dev fd is the master) and leaves the
> > driver hanging in the air if there's an interrupt still pending (or
> > anything else fwiw).
>
> For me that crash happened when xorg exited with a fatal error too.
Is this fatal error itself a regression or have you seen that on older
kernels, too?
Like I've said the entire teardown sequence for legacy drm drivers is
terminally busted, so the only hope we have is to reapply this missing
duct-tape which made your X crash. But if that itself isn't a regression
there's no way to fix the current drm/mga driver without a complete
rewrite as a new-style kernel modesetting driver.
-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/