Re: [bisected] pre-3.16 regression on open() scalability

From: Paul E. McKenney
Date: Thu Jun 19 2014 - 00:19:26 EST


On Wed, Jun 18, 2014 at 08:38:16PM -0700, Andi Kleen wrote:
> On Wed, Jun 18, 2014 at 07:13:37PM -0700, Paul E. McKenney wrote:
> > On Wed, Jun 18, 2014 at 06:42:00PM -0700, Andi Kleen wrote:
> > >
> > > I still think it's totally the wrong direction to pollute so
> > > many fast paths with this obscure debugging check workaround
> > > unconditionally.
> >
> > OOM prevention should count for something, I would hope.
>
> OOM in what scenario? This is getting bizarre.

On the bizarre part, at least we agree on something. ;-)

CONFIG_NO_HZ_FULL booted with at least one nohz_full CPU. Said CPU
gets into the kernel and stays there, not necessarily generating RCU
callbacks. The other CPUs are very likely generating RCU callbacks.
Because the nohz_full CPU is in the kernel, and because there are no
scheduling-clock interrupts on that CPU, grace periods do not complete.
Eventually, the callbacks from the other CPUs (and perhaps also some
from the nohz_full CPU, for that matter) OOM the machine.

Now this scenario constitutes an abuse of CONFIG_NO_HZ_FULL, because it
is intended for CPUs that execute either in userspace (in which case
those CPUs are in extended quiescent states so that RCU can happily
ignore them) or for real-time workloads with low CPU untilization (in
which case RCU sees them go idle, which is also a quiescent state).
But that won't stop people from abusing their kernels and complaining
when things break.

This same thing can also happen without CONFIG_NO_HZ full, though
the system has to work a bit harder. In this case, the CPU looping
in the kernel has scheduling-clock interrupts, but if all it does
is cond_resched(), RCU is never informed of any quiescent states.
The whole point of this patch is to make those cond_resched() calls,
which are quiescent states, visible to RCU.

> If something keeps looping forever in the kernel creating
> RCU callbacks without any real quiescent states it's simply broken.

I could get behind that. But by that definition, there is a lot of
breakage in the current kernel, especially as we move to larger CPU
counts.

Thanx, Paul

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/