Re: mm: deadlock between get_online_cpus/pcpu_alloc

From: Thomas Gleixner
Date: Thu Feb 09 2017 - 12:41:26 EST


On Thu, 9 Feb 2017, Christoph Lameter wrote:
> On Thu, 9 Feb 2017, Thomas Gleixner wrote:
>
> > You are just not getting it, really.
> >
> > The problem is that this for_each_online_cpu() is racy against a concurrent
> > hot unplug and therefor can queue stuff for a not longer online cpu. That's
> > what the mm folks tried to avoid by preventing a CPU hotplug operation
> > before entering that loop.
>
> With a stop machine action it is NOT racy because the machine goes into a
> special kernel state that guarantees that key operating system structures
> are not touched. See mm/page_alloc.c's use of that characteristic to build
> zonelists. Thus it cannot be executing for_each_online_cpu and related
> tasks (unless one does not disable preempt .... but that is a given if a
> spinlock has been taken)..

drain_all_pages() is called from preemptible context. So what are you
talking about again?

> > > Lets get rid of get_online_cpus() etc.
> >
> > And that solves what?
>
> It gets rid of future issues with serialization in paths were we need to
> lock and still do for_each_online_cpu().

There are code pathes which might sleep inside the loop so
get_online_cpus() is the only way to serialize against hotplug.

Just because the only tool you know is stop machine it does not make
everything an atomic context where it can be applied.

> > Can you please start to understand the scope of the whole hotplug machinery
> > including the requirements for get_online_cpus() before you waste
> > everybodys time with your uninformed and halfbaken proposals?
>
> Its an obvious solution to the issues that have arisen multiple times with
> get_online_cpus() within the slab allocators. The hotplug machinery should
> make things as easy as possible for other people and having these
> get_online_cpus() everywhere does complicate things.

It's no obvious solution to everything. It's context dependend and people
have to think hard how to solve their problem within the context they are
dealing with.

Your 'get rid of get_online_cpus()' mantra does make all of this magically
simple. Relying on the fact, that the CPU online bit is cleared via
stomp_machine(), which is a horrible operation in various aspects, only
applies to a very small subset of problems. You can repeat your mantra
another thousand times and that will not make the way larger set of
problems magically disappear.

Thanks,

tglx