Re: [RFC PATCH] cpu hotplug: rework cpu_hotplug locking (was[LOCKDEP] cpufreq: possible circular locking dependency detected)

From: Sergey Senozhatsky
Date: Sat Jun 29 2013 - 03:36:46 EST


On (06/28/13 19:43), Srivatsa S. Bhat wrote:
> On 06/28/2013 01:14 PM, Sergey Senozhatsky wrote:
> > Hello,
> > Yes, this helps, of course, but at the same time it returns the previous
> > problem -- preventing cpu_hotplug in some places.
> >
> >
> > I have a bit different (perhaps naive) RFC patch and would like to hear
> > comments.
> >
> >
> >
> > The idead is to brake existing lock dependency chain by not holding
> > cpu_hotplug lock mutex across the calls. In order to detect active
> > refcount readers or active writer, refcount now may have the following
> > values:
> >
> > -1: active writer -- only one writer may be active, readers are blocked
> > 0: no readers/writer
> >> 0: active readers -- many readers may be active, writer is blocked
> >
> > "blocked" reader or writer goes to wait_queue. as soon as writer finishes
> > (refcount becomes 0), it wakeups all existing processes in a wait_queue.
> > reader perform wakeup call only when it sees that pending writer is present
> > (active_writer is not NULL).
> >
> > cpu_hotplug lock now only required to protect refcount cmp, inc, dec
> > operations so it can be changed to spinlock.
> >
>
> Hmm, now that I actually looked at your patch, I see that it is completely
> wrong! I'm sure you intended to fix the *bug*, but instead you ended
> up merely silencing the *warning* (and also left lockdep blind), leaving
> the actual bug as it is!
>

Thank you for your time and review.


> So let me summarize what the actual bug is and what is it that actually
> needs fixing:
>
> Basically you have 2 things -
> 1. A worker item (cs_dbs_timer in this case) that can re-arm itself
> using gov_queue_work(). And gov_queue_work() uses get/put_online_cpus().
>
> 2. In the cpu_down() path, you want to cancel the worker item and destroy
> and cleanup its resources (the timer_mutex).
>
> So the problem is that you can deadlock like this:
>
> CPU 3 CPU 4
>
> cpu_down()
> -> acquire hotplug.lock
>
> cs_dbs_timer()
> -> get_online_cpus()
> //wait on hotplug.lock
>
>
> try to cancel cs_dbs_timer()
> synchronously.
>
> That leads to a deadlock, because, cs_dbs_timer() is waiting to
> get the hotplug lock which CPU 3 is holding, whereas CPU 3 is
> waiting for cs_dbs_timer() to finish. So they can end up mutually
> waiting for each other, forever. (Yeah, the lockdep splat might have
> been a bit cryptic to decode this, but here it is).
>
> So to fix the *bug*, you need to avoid waiting synchronously while
> holding the hotplug lock. Possibly by using cancel_delayed_work_sync()
> under CPU_POST_DEAD or something like that. That would remove the deadlock
> possibility.

will take a look. Thank you!

-ss

> Your patch, on the other hand, doesn't remove the deadlock possibility:
> just because you don't hold the lock throughout the hotplug operation
> doesn't mean that the task calling get_online_cpus() can sneak in and
> finish its work in-between a hotplug operation (because the refcount
> won't allow it to). Also, it should *not* be allowed to sneak in like
> that, since that constitutes *racing* with CPU hotplug, which it was
> meant to avoid!.
>
> Also, as a side effect of not holding the lock throughout the hotplug
> operation, lockdep goes blind, and doesn't complain, even though the
> actual bug is still there! Effectively, this is nothing but papering
> over the bug and silencing the warning, which we should never do.
>
> So, please, fix the _cpufreq_ code to resolve the deadlock.
>
> Regards,
> Srivatsa S. Bhat
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/