Re: [PATCH 0/1] [RFC] DRM locking issues during early open

From: Jesse Barnes
Date: Thu Apr 19 2012 - 12:55:18 EST


On Thu, 19 Apr 2012 17:52:39 +0100
Dave Airlie <airlied@xxxxxxxxx> wrote:

> On Thu, Apr 19, 2012 at 5:47 PM, Dave Airlie <airlied@xxxxxxxxx> wrote:
> > On Thu, Apr 19, 2012 at 5:41 PM, Andy Whitcroft <apw@xxxxxxxxxxxxx> wrote:
> >> On Thu, Apr 19, 2012 at 05:30:03PM +0100, Dave Airlie wrote:
> >>> On Thu, Apr 19, 2012 at 5:22 PM, Andy Whitcroft <apw@xxxxxxxxxxxxx> wrote:
> >>> > We have been carrying a (rather poor) patch for an issue we identified in
> >>> > the DRM driver. ÂThis issue is triggered when a DRM device is initialising
> >>> > and userspace attempts to open it, typically in response to the sysfs
> >>> > device added event. ÂBasically we allocate the minor numbers making
> >>> > the device available, and then call the drm load callback. ÂUntil this
> >>> > completes the device is really not ready and these early opens typically
> >>> > lead to oopses.
> >>> >
> >>> > We have been using the following patch to avoid this by marking the minors
> >>> > as in error until the load method has completed. ÂThis avoids the early
> >>> > open by simply erroring out the opens with EAGAIN. ÂObviously we should
> >>> > be delaying the open until the load method complete.
> >>> >
> >>> > I include the existing patch for completness (it is not really ready for
> >>> > merging) to illustrate the issue. ÂI think it is logical that the wait
> >>> > should simply be delayed until the load has completed. ÂI am proposing
> >>> > to include a wait queue associated with the idr cache for the drm minors
> >>> > which we can use to allow open callers to wait_event_interruptible() on.
> >>> > I'll be putting together a prototype shortly and will follow up with it.
> >>> >
> >>> > Thoughts?
> >>>
> >>> Couldn't we just delay registering things until the driver is ready to
> >>> accept an open?
> >>>
> >>> Granted the midlayer of drm doesn't make that easy,
> >>
> >> It seems that we need the dri minor allocated before we hit the load
> >> function as things are done right now.
> >>
> >>> thanks for sending this out, it keeps falling off my radar, I don't
> >>> think I've ever seen this reported on RHEL/Fedora, which makes me
> >>> wonder what we are doing that makes us lucky.
> >>
> >> We never hit it until we started doing things earlier and quicker. ÂI first
> >> found it in the prettification of boot so we were keen to get plymouth
> >> running as soon as possible. ÂThat lead to random panics and me finding
> >> this bug. ÂThe window is tiny as far as I know and it tends to be specific
> >> machines and specific package combinations which trigger it reliably.
> >>
> >> I suspect that a proper fix would allow delaying the registration as you
> >> suggest but in the interim a wait would at least avoid the issues we are
> >> seeing. ÂI will see how awful it looks.
> >
> > Just to confirm its the drm_sysfs_device_add that causes the race we care about.
> >
> > it needs to happen after the driver is happy. Since it calls
> > device_register and that is what triggers udev magic to load the
> > userspace.
> >
> > If you have a userspace app banging on a static device node that might
> > need another set of fun fixes.
>
> Okay the sysfs add and the idr_replace are the things we need to delay.

Since you can still get at things with a static node, it seems like
locking is the real issue here? Is there no mutex we can take across
init to block any openers until we're done?

--
Jesse Barnes, Intel Open Source Technology Center
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/