Re: rcu_preempt self-detected stall on CPU from 4.5-rc3, since 3.17

From: Mathieu Desnoyers
Date: Mon Mar 28 2016 - 12:12:33 EST


----- On Mar 28, 2016, at 11:56 AM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx wrote:

> On Mon, Mar 28, 2016 at 03:07:36PM +0000, Mathieu Desnoyers wrote:
>> ----- On Mar 28, 2016, at 9:29 AM, Paul E. McKenney paulmck@xxxxxxxxxxxxxxxxxx
>> wrote:
>>
>> > On Mon, Mar 28, 2016 at 08:28:51AM +0200, Peter Zijlstra wrote:
>> >> On Sun, Mar 27, 2016 at 02:09:14PM -0700, Paul E. McKenney wrote:
>> >>
>> >> > > Does that system have MONITOR/MWAIT errata?
>> >> >
>> >> > On the off-chance that this question was also directed at me,
>> >>
>> >> Hehe, it wasn't, however, since we're here..
>> >>
>> >> > here is
>> >> > what I am running on. I am running in a qemu/KVM virtual machine, in
>> >> > case that matters.
>> >>
>> >> Have you actually tried on real proper hardware? Does it still reproduce
>> >> there?
>> >
>> > Ross has, but I have not, given that I have a shared system on the one
>> > hand and a single-socket (four core, eight hardware thread) laptop on
>> > the other that has even longer reproduction times. The repeat-by is
>> > as follows:
>> >
>> > o Build a kernel with the following Kconfigs:
>> >
>> > CONFIG_SMP=y
>> > CONFIG_NR_CPUS=16
>> > CONFIG_PREEMPT_NONE=n
>> > CONFIG_PREEMPT_VOLUNTARY=n
>> > CONFIG_PREEMPT=y
>> > # This should result in CONFIG_PREEMPT_RCU=y
>> > CONFIG_HZ_PERIODIC=y
>> > CONFIG_NO_HZ_IDLE=n
>> > CONFIG_NO_HZ_FULL=n
>> > CONFIG_RCU_TRACE=y
>> > CONFIG_HOTPLUG_CPU=y
>> > CONFIG_RCU_FANOUT=2
>> > CONFIG_RCU_FANOUT_LEAF=2
>> > CONFIG_RCU_NOCB_CPU=n
>> > CONFIG_DEBUG_LOCK_ALLOC=n
>> > CONFIG_RCU_BOOST=y
>> > CONFIG_RCU_KTHREAD_PRIO=2
>> > CONFIG_DEBUG_OBJECTS_RCU_HEAD=n
>> > CONFIG_RCU_EXPERT=y
>> > CONFIG_RCU_TORTURE_TEST=y
>> > CONFIG_PRINTK_TIME=y
>> > CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
>> > CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
>> > CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
>> >
>> > If desired, you can instead build with CONFIG_RCU_TORTURE_TEST=m
>> > and modprobe/insmod the module manually.
>> >
>> > o Find a two-socket x86 system or larger, with at least 16 CPUs.
>> >
>> > o Boot the kernel with the following kernel boot parameters:
>> >
>> > rcutorture.onoff_interval=1 rcutorture.onoff_holdoff=30
>> >
>> > The onoff_holdoff is only needed for CONFIG_RCU_TORTURE_TEST=y.
>> > When manually setting up the module, you get the holdoff for
>> > free, courtesy of human timescales.
>> >
>> > In the absence of instrumentation, I get failures usually within a
>> > couple of hours, though sometimes much longer. With instrumentation,
>> > the sky appears to be the limit. :-/
>> >
>> > Ross is running on bare metal with no CPU hotplug, so perhaps his setup
>> > is of more immediate interest. He is seeing the same symptoms that I am,
>> > namely a task being repeatedly awakened without actually coming out of
>> > TASK_INTERRUPTIBLE state, let alone running. As you pointed out earlier,
>> > he cannot be seeing the same bug that my crude patch suppresses, but
>> > given that I still see a few failures with that crude patch, it is quite
>> > possible that there is still a common bug.
>>
>> With respect to bare metal vs KVM guest, I've reported an issue with
>> inaccurate detection of TSC as being an unreliable time source on a
>> KVM guest. The basic setup is to overcommit the CPU use across the
>> entire host, thus leading to preemption of the guest. The guest TSC
>> watchdog then falsely assume that TSC is unreliable, because it gets
>> preempted for a long time (e.g. 0.5 second) between reading the HPET
>> and the TSC.
>>
>> Ref. http://lkml.iu.edu/hypermail/linux/kernel/1509.1/00379.html
>>
>> I'm wondering if what Paul is observing in the KVM setup might be
>> caused by long preemption by the host. One way to stress test this
>> is to run parallel kernel builds on the host (or in another guest)
>> while the guest is running, thus over-committing the CPU use.
>>
>> Thoughts ?
>
> If I run NO_HZ_FULL, I do get warnings about unstable timesources.
>
> And certainly guest VCPUs can be preempted. However, if they were
> preempted for the lengths of time I am seeing, I should also see
> softlockup warnings on the host, which I do not see.

Why would you see softlockup warning on the host ?

I expect the priority at which the kvm vcpu runs is much lower than
the priority of the rcu worker threads on the host. Therefore, you
might very well have long preemption delays for kvm vpus while the
rcu worker threads run fine on the host kernel because they have
a higher priority.

Am I missing something ?

Thanks,

Mathieu

>
> That said, perhaps I should cobble together something to force short
> repeated preemptions at the host level. Maybe that would get the
> reproduction rate sufficiently high to enable less-dainty debugging.
>
> Thanx, Paul

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com