Re: [PATCH v3] perf/arm-dmc620: Fix dmc620_pmu_irqs_lock/cpu_hotplug_lock circular lock dependency

From: Waiman Long
Date: Fri Aug 04 2023 - 12:52:33 EST


On 8/4/23 12:28, Will Deacon wrote:
struct dmc620_pmu {
@@ -423,9 +424,14 @@ static struct dmc620_pmu_irq *__dmc620_pmu_get_irq(int irq_num)
struct dmc620_pmu_irq *irq;
int ret;
- list_for_each_entry(irq, &dmc620_pmu_irqs, irqs_node)
- if (irq->irq_num == irq_num && refcount_inc_not_zero(&irq->refcount))
+ list_for_each_entry(irq, &dmc620_pmu_irqs, irqs_node) {
+ if (irq->irq_num != irq_num)
+ continue;
+ if (!irq->valid)
+ return ERR_PTR(-EAGAIN); /* Try again later */
It looks like this can bubble up to the probe() routine. Does the driver
core handle -EAGAIN coming back from a probe routine?
Right, I should add code to handle this error condition. I think it can be
handled in dmc620_pmu_get_irq(). The important thing is to release the
mutex, wait a few ms and try again. What do you think?
I don't really follow, but waiting a few ms and trying again sounds like
a really nasty hack for something which doesn't appear to be constrained
by broken hardware. In other words, we got ourselves into this mess, so
we should be able to resolve it properly.

From my point of view, the proper way to solve the problem is to reverse the locking order. Since you don't to add a EXPORT statement to the core kernel code, we will have to find a way around it by not holding the dmc620_pmu_irqs_lock when cpuhp_state_add_instance_nocalls() is called. Another alternative that I can think of is to add one more mutex that we will hold just for the entirety of  __dmc620_pmu_get_irq() and take dmc620_pmu_irqs_lock only when the linked list is being modified. That will eliminate the need to introduce arbitrary wait as other caller of __dmc620_pmu_get_irq() will wait in the new mutex. Will this work for you?

Cheers,
Longman