Re: [BUG] drm: zynqmp_dp: Lockup in zynqmp_dp_bridge_detect when device is unbound

From: Laurent Pinchart
Date: Mon May 06 2024 - 12:51:13 EST


On Mon, May 06, 2024 at 10:57:17AM -0400, Sean Anderson wrote:
> On 5/6/24 03:35, Laurent Pinchart wrote:
> > On Mon, May 06, 2024 at 09:29:36AM +0200, Maxime Ripard wrote:
> >> Hi Laurent, Sean,
> >>
> >> On Sat, May 04, 2024 at 03:21:18PM GMT, Laurent Pinchart wrote:
> >> > On Fri, May 03, 2024 at 05:54:32PM -0400, Sean Anderson wrote:
> >> > > I have discovered a bug in the displayport driver on drm-misc-next. To
> >> > > trigger it, run
> >> > >
> >> > > echo fd4a0000.display > /sys/bus/platform/drivers/zynqmp-dpsub/unbind
> >> > >
> >> > > The system will become unresponsive and (after a bit) splat with a hard
> >> > > LOCKUP. One core will be unresponsive at the first zynqmp_dp_read in
> >> > > zynqmp_dp_bridge_detect.
> >> > >
> >> > > I believe the issue is due the registers being unmapped and the block
> >> > > put into reset in zynqmp_dp_remove instead of zynqmp_dpsub_release.
> >> >
> >> > That is on purpose. Drivers are not allowed to access the device at all
> >> > after .remove() returns.
> >>
> >> It's not "on purpose" no. Drivers indeed are not allowed to access the
> >> device after remove, but the kernel shouldn't crash. This is exactly
> >> why we have drm_dev_enter / drm_dev_exit.
> >
> > I didn't mean the crash was on purpose :-) It's the registers being
> > unmapped that is, as nothing should touch those registers after
> > .remove() returns.
>
> OK, so then we need to have some kind of flag in the driver or in the drm
> subsystem so we know not to access those registers.

To avoid race conditions, the .remove() function should mark the device
as removed, wait for all ongoing access from userspace to be complete,
and then proceed to unmapping registers and doing other cleanups.
Userspace may still have open file descriptors to the device at that
point. Any new userspace access should be disallowed (by checking the
removed flag), with the only userspace-initiated operations that still
need to run being the release-related operations (unmapping memory,
closing file descriptors, ...).

--
Regards,

Laurent Pinchart