Re: [PATCH] drm/gpuvm: take refcount on DRM device
From: Thomas Hellström
Date: Mon Apr 20 2026 - 05:29:09 EST
On Fri, 2026-04-17 at 21:33 +0200, Danilo Krummrich wrote:
> On Fri Apr 17, 2026 at 4:41 PM CEST, Thomas Hellström wrote:
> > This is problematic since typically you also need a module
> > reference
> > when taking a drm device reference.
> >
> > The reason for this is that the devres reference on the drm device
> > expects to be the last one, since it might be called from the
> > module
> > exit function of the driver.
>
> No, this is not how it works; if this would be true then drmm_* would
> be pretty
> pointless in the first place, as one could just use devm_* for
> everything.
>
> Citing the commit introducing drmm_* APIs:
>
> "The biggest wrong pattern is that developers use devm_,
> which ties the
> release action to the underlying struct device, whereas all
> the
> userspace visible stuff attached to a drm_device can long
> outlive that
> one (e.g. after a hotunplug while userspace has open files
> and mmap'ed
> buffers)."
Yeah, I was a bit unclear and partly incorrect. This only happens *if
there are no other holders of the driver module reference. (But see
below WRT potential other holders)-
driver_module_unload()->...->pci_dev_remove -> devm_release->
drm_dev_put-><module is unloaded>.
So if, at this point there are additional drm device references,
they'd point to dangling devices.
>
> > Now if there is an additional reference held at that point the
> > driver module
> > can be unloaded with a dangling reference to the drm device.
> >
> > On the other hand, if you in addition take a module reference then
> > that
> > blocks the driver module from being unloaded while held, just like
> > a
> > drm file reference. This leads to complicated module release
> > schemes
> > like the one in drm_pagemap where the module refcount is released
> > from
> > a work item that is waited on in the drm_pagemap exit function.
> >
> > I'm working to lift the module refcount requirement, but meanwhile
> > I'd
> > recommend that in the file close callback, we'd make sure all
> > drm_gpuvms have called their drm_gpuvm_free() function, because
> > then we
> > are sure that the drm_device is still alive and the module still
> > pinned.
>
> If GPUVM has a pointer to the DRM device, it implies shared ownership
> and hence
> GPUVM should account for this shared ownership and take a reference
> count.
>
> The fact that GPUVM must not outlive module unload when it has driver
> callbacks
> attached is an orthogonal requirement.
>
> The module lifetime / callback issue is a separate problem that
> exists
> regardless of whether you hold a device refcount. Not taking the
> refcount
> doesn't make the module problem go away, it just adds a second,
> independent bug.
>
> If struct drm_device itself, e.g. due to drm_dev_release() requires a
> module
> refcount, then this is on struct drm_device to ensure this constraint
> (or remove
> the requirement).
>
> IOW, if I get to choose between a DRM component that has a pointer to
> a DRM
> device stalls module unload and a DRM component that has a pointer to
> a DRM
> device oopses the kernel when used wrongly, I prefer the former.
I agree with your reasoning here, but current fact is that most (if not
all) holders of a drm device reference (files, pagemaps, dma-bufs)
currently also hold a module reference to protect against this, and
drm_gpuvm would be an outlier.
To fix this properly (lifting that requirement) one could introduce a
drm device count in the module and have the module exit function wait
for it to become zero, *and* that the code that did the last decrement
finished executing.
https://patchwork.freedesktop.org/patch/712146/?series=163298&rev=1
Or one could also have the drm device hold a reference count on the
driver module, but that would block unloading without previous unbind
which is not typical driver behaviour and would likely be seen as a
regression.
/Thomas
>
> - Danilo