Re: [RFC PATCH 0/3] gpu: nova-core: add basic timer subdevice implementation

From: Jason Gunthorpe
Date: Mon Mar 03 2025 - 16:50:19 EST


On Mon, Mar 03, 2025 at 08:36:34PM +0100, Danilo Krummrich wrote:
> > > And yes, for *device resources* it is unsound if we do not ensure that the
> > > device resource is actually dropped at device unbind.
> >
> > Why not do a runtime validation then?
> >
> > It would be easy to have an atomic counting how many devres objects
> > are still alive.
>
> (1) It would not be easy at all, if not impossible.
>
> A Devres object doesn't know whether it's embedded in an Arc<Devres>, nor does
> it know whether it is embedded in subsequent Arc containers, e.g.
> Arc<Arc<Devres>>.

You aren't tracking that. If Rust has a problem, implement it in C:

devm_rsgc_start(struct device *)
devm_rsgc_pci_iomap(struct device *,..)
devm_rsgc_pci_iounmap(struct device *,..)
devm_rsgc_fence(struct device *)

Surely that isn't a problem for bindings?

> > Properly written drives never hit it. Buggy drivers will throw a
> > warning and otherwise function safely.
>
> Ignoring (1), I think that's exactly the opposite of what we want to achieve.
>
> This would mean that the Rust abstraction does *not avoid* but *only detect* the
> problem.

It is preventing memory leaks. We would like to prevent memory leaks
in the kernel. Currently you can't even detect them, so this seems
like an improvement.

I am using the practical kernel definition for "memory leak" of memory
outliving remove() rather than the more philisophical rust version
that memory is someday, eventually, freed.

In an environment like the kernel the distinction is important and
desirable.

> The practical problem: Buggy drivers could (as you propose) stall the
> corresponding task forever, never releasing the device resource.

Should't be never. Rust never leaks memory? It will eventually be
unstuck when Rust frees the memory in the concurrent context that is
holding it.

> Not releasing the device resource may stall subsequent drivers
> trying to probe the device, or, if the physical memory region has
> been reassigned to another device, prevent another device from
> probing. This is *not* what I would call "function safely".

It didn't create a security problem.

The driver could also write

while (true)
sleep(1)

In the remove function and achieve all of the above bad things, should
you try to statically prevent that too? I've written stuff like the
above in drivers before, it is an easy way to make thing safe when
working with FDs, so don't think nobody would do that.

So I don't see alot of value in arguing you can close of 1 way to
achieve something bad and still leave open all sorts of others. At
least is more philisophical than I prefer to get.

It is better to focus on usability, performance, and bug class
detection, IMHO.

Jason