Re: [PATCH v1 RESEND 4/4] drm/tyr: add GPU reset handling

From: Daniel Almeida

Date: Thu Apr 09 2026 - 09:45:15 EST

Hi Onur,

> On 9 Apr 2026, at 08:41, Onur Özkan <work@xxxxxxxxxxxxx> wrote:
>
> On Fri, 03 Apr 2026 12:01:09 -0300
> Daniel Almeida <daniel.almeida@xxxxxxxxxxxxx> wrote:
>
>>
>>
>>> On 19 Mar 2026, at 08:08, Boris Brezillon <boris.brezillon@xxxxxxxxxxxxx> wrote:
>>>
>>> On Fri, 13 Mar 2026 12:16:44 +0300
>>> Onur Özkan <work@xxxxxxxxxxxxx> wrote:
>>>
>>>> +impl Controller {
>>>> + /// Creates a [`Controller`] instance.
>>>> + fn new(pdev: ARef<platform::Device>, iomem: Arc<Devres<IoMem>>) -> Result<Arc<Self>> {
>>>> + let wq = workqueue::OrderedQueue::new(c"tyr-reset-wq", 0)?;
>>>> +
>>>> + Arc::pin_init(
>>>> + try_pin_init!(Self {
>>>> + pdev,
>>>> + iomem,
>>>> + pending: Atomic::new(false),
>>>> + wq,
>>>> + work <- kernel::new_work!("tyr::reset"),
>>>> + }),
>>>> + GFP_KERNEL,
>>>> + )
>>>> + }
>>>> +
>>>> + /// Processes one scheduled reset request.
>>>> + ///
>>>> + /// Panthor reference:
>>>> + /// - drivers/gpu/drm/panthor/panthor_device.c::panthor_device_reset_work()
>>>> + fn reset_work(self: &Arc<Self>) {
>>>> + dev_info!(self.pdev.as_ref(), "GPU reset work is started.\n");
>>>> +
>>>> + // SAFETY: `Controller` is part of driver-private data and only exists
>>>> + // while the platform device is bound.
>>>> + let pdev = unsafe { self.pdev.as_ref().as_bound() };
>>>> + if let Err(e) = run_reset(pdev, &self.iomem) {
>>>> + dev_err!(self.pdev.as_ref(), "GPU reset failed: {:?}\n", e);
>>>> + } else {
>>>> + dev_info!(self.pdev.as_ref(), "GPU reset work is done.\n");
>>>> + }
>>>
>>> Unfortunately, the reset operation is not as simple as instructing the
>>> GPU to reset, it's a complex synchronization process where you need to
>>> try to gracefully put various components on hold before you reset, and
>>> then resume those after the reset is effective. Of course, with what we
>>> currently have in-tree, there's not much to suspend/resume, but I think
>>> I'd prefer to design the thing so we can progressively add more
>>> components without changing the reset logic too much.
>>>
>>> I would probably start with a Resettable trait that has the
>>> {pre,post}_reset() methods that exist in Panthor.
>>>
>>> The other thing we need is a way for those components to know when a
>>> reset is about to happen so they can postpone some actions they were
>>> planning in order to not further delay the reset, or end up with
>>> actions that fail because the HW is already unusable. Not too sure how
>>> we want to handle that though. Panthor is currently sprinkled with
>>> panthor_device_reset_is_pending() calls in key places, but that's still
>>> very manual, maybe we can automate that a bit more in Tyr, dunno.
>>>
>>
>>
>> We could have an enum where one of the variants is Resetting, and the other one
>> gives access to whatever state is not accessible while resets are in progress.
>>
>> Something like
>>
>> pub enum TyrData {
>> Active(ActiveTyrData),
>> ResetInProgress(ActiveTyrData)
>> }
>>
>> fn access() -> Option<&ActiveTyrData> {
>> … // if the “ResetInProgress” variant is active, return None
>> }
>>
>
> That's an interesting approach, but if it's all about `fn access` function, we
> can already do that with a simple atomic state e.g.,:
>
> // The state flag in reset::Controller
> state: Atomic<ResetState>
>
> fn access(&self) -> Option<&Arc<Devres<IoMem>>> {
> match self.state.load(Relaxed) {
> ResetState::Idle => Some(&self.iomem),
> _ => None,
> }
> }
>
> What do you think? Would this be sufficient?
>
> Btw, a sample code snippet from the caller side would be very helpful for
> designing this further. That part is kind a blurry for me.
>
> Thanks,
> Onur
>
>>
>>>> +
>>>> + self.pending.store(false, Release);
>>>> + }
>>>> +}

I think that there are two things we're trying to implement correctly:

1) Deny access to a subset of the state while a reset is in progress
2) Wait for anyone accessing 1) to finish before starting a reset

IIUC, using Atomic<T> can solve 1 by bailing if the "reset in progress"
flag/variant is set, but I don't think it implements 2? One would have to
implement more logic to block until the state is not being actively used.

Now, there are probably easier ways to solve this, but I propose that we do the
extra legwork to make this explicit and enforceable by the type system.

How about introducing a r/w semaphore abstraction? It seems to correctly encode
the logic we want:

a) multiple users can access the state if no reset is pending ("read" side)
b) the reset code can block until the state is no longer being accessed (the "write" side)

In Tyr, this would roughly map to something like:

struct TyrData {
reset_gate: RwSem<ActiveHwState>
// other, always accessible members
}

impl TyrData {
fn try_access(&self) -> Option<ReadGuard<'_, ActiveHwState>> {...}
}

Where ActiveHwState contains the fw/mmu/sched blocks (these are not upstream
yet, Deborah has a series that will introduce the fw block that should land
soon) and perhaps more.

Now, the reset logic would be roughly:

fn reset_work(tdev: Arc<TyrDevice>) {
// Block until nobody else is accessing the hw, prevent others from
// initiating new accesses too..
let _guard = tdev.reset_gate.write();

// pre_reset() all Resettable implementors

... reset

// post_Reset all Resettable implementors
}

Now, for every block that might touch a resource that would be unavailable
during a reset, we enforce a try_access() via the type system, and ensure that
the reset cannot start while the guard is alive. In particular, ioctls would
look like:

fn ioctl_foo(...) {
let hw = tdev.reset_gate.try_access()?;
// resets are blocked while the guard is alive, no other way to access that state otherwise
}

The code will not compile otherwise, so long as we keep the state in ActiveHwState, i.e.:
protected by the r/w sem.

This looks like an improvement over Panthor, since Panthor relies on manually
canceling work that access hw state via cancel_work_sync(), and gating new work
submissions on the "reset_in_progress" flag, i.e.:

/**
* sched_queue_work() - Queue a scheduler work.
* @sched: Scheduler object.
* @wname: Work name.
*
* Conditionally queues a scheduler work if no reset is pending/in-progress.
*/
#define sched_queue_work(sched, wname) \
do { \
if (!atomic_read(&(sched)->reset.in_progress) && \
!panthor_device_reset_is_pending((sched)->ptdev)) \
queue_work((sched)->wq, &(sched)->wname ## _work); \
} while (0)

Thoughts?

— Daniel