Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Mathieu Desnoyers
Date: Mon Mar 28 2016 - 11:07:48 EST


----- On Mar 28, 2016, at 9:29 AM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx wrote:

> On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
>> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
>>
>> > > Does that system have MONITOR/MWAIT errata?
>> >
>> > On the off-chance that this question was also directed at me,
>>
>> Hehe, it wasn't, however, since we're here..
>>
>> > here is
>> > what I am running on. I am running in a qemu/KVM virtual machine, in
>> > case that matters.
>>
>> Have you actually tried on real proper hardware? Does it still reproduce
>> there?
>
> Ross has, but I have not, given that I have a shared system on the one
> hand and a single-socket (four core, eight hardware thread) laptop on
> the other that has even longer reproduction times. The repeat-by is
> as follows:
>
> o Build a kernel with the following Kconfigs:
>
> CONFIG_SMP=y
> CONFIG_NR_CPUS=16
> CONFIG_PREEMPT_NONE=n
> CONFIG_PREEMPT_VOLUNTARY=n
> CONFIG_PREEMPT=y
> # This should result in CONFIG_PREEMPT_RCU=y
> CONFIG_HZ_PERIODIC=y
> CONFIG_NO_HZ_IDLE=n
> CONFIG_NO_HZ_FULL=n
> CONFIG_RCU_TRACE=y
> CONFIG_HOTPLUG_CPU=y
> CONFIG_RCU_FANOUT=2
> CONFIG_RCU_FANOUT_LEAF=2
> CONFIG_RCU_NOCB_CPU=n
> CONFIG_DEBUG_LOCK_ALLOC=n
> CONFIG_RCU_BOOST=y
> CONFIG_RCU_KTHREAD_PRIO=2
> CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
> CONFIG_RCU_EXPERT=y
> CONFIG_RCU_TORTURE_TEST=y
> CONFIG_PRINTK_TIME=y
> CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
> CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
> CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
>
> If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
> and modprobe/insmod the module manually.
>
> o Find a two-socket x86 system or larger, with at least 16 CPUs.
>
> o Boot the kernel with the following kernel boot parameters:
>
> rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
>
> The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
> When manually setting up the module, you get the holdoff for
> free, courtesy of human timescales.
>
> In the absence of instrumentation, I get failures usually within a
> couple of hours, though sometimes much longer. With instrumentation,
> the sky appears to be the limit. :-/
>
> Ross is running on bare metal with no CPU hotplug, so perhaps his setup
> is of more immediate interest. He is seeing the same symptoms that I am,
> namely a task being repeatedly awakened without actually coming out of
> TASK_INTERRUPTIBLE state, let alone running. As you pointed out earlier,
> he cannot be seeing the same bug that my crude patch suppresses, but
> given that I still see a few failures with that crude patch, it is quite
> possible that there is still a common bug.

With respect to bare metal vs KVM guest, I've reported an issue with
inaccurate detection of TSC as being an unreliable time source on a
KVM guest. The basic setup is to overcommit the CPU use across the
entire host, thus leading to preemption of the guest. The guest TSC
watchdog then falsely assume that TSC is unreliable, because it gets
preempted for a long time (e.g. 0.5 second) between reading the HPET
and the TSC.

Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html

I'm wondering if what Paul is observing in the KVM setup might be
caused by long preemption by the host. One way to stress test this
is to run parallel kernel builds on the host (or in another guest)
while the guest is running, thus over-committing the CPU use.

Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com