Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"

From: Paul E. McKenney
Date: Thu Jan 24 2019 - 15:55:02 EST


On Thu, Jan 24, 2019 at 03:00:37PM +0800, Su Yue wrote:
> On 1/23/19 11:22 AM, Paul E. McKenney wrote:
> >On Tue, Jan 22, 2019 at 04:42:19PM +0800, Su Yue wrote:
> >>Thanks for your quick reply! Paul
> >>
> >>On 1/22/19 12:01 PM, Paul E. McKenney wrote:
> >>>On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
> >>>>Hi, guys
> >>>> While running rcutorture tests with "onoff_interval", some tests
> >>>>failed and results show like:
> >>>>
> >>>>=====================================================================
> >>>>[ 316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
> >>>>nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
> >>>>test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
> >>>>s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
> >>>>test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
> >>>>stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
> >>>>rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
> >>>>====================================================================
> >>>>
> >>>>I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
> >>>>cpu hotplug is enabled in the test? Or just represents another type of
> >>>>failure?
> >>>
> >>>This says that at least one CPU hotplug operation failed, that is,
> >>>the CPU didn't actually come online or go offline as requested. If you
> >>>are introducing CPU hotplug to an architecture, this usually indicates
> >>>that you have bugs in your CPU-hotplug code. Or it nmight be that
> >>
> >>It should hit the case since there is no RCU CPU stall warnings.
> >>
> >>>RCU grace periods failed to progress -- though this would normally
> >>>also result in RCU CPU stall warnings.
> >>>
> >>>There should be lines containing "ver:" in your console output. What
> >>>does one of the later one of these say?
> >>>
> >>
> >>The line says:
> >>======================================================================
> >>[ 318.850175] busted_srcud-torture: rtc: (null) ver:
> >>27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke:
> >>0 rtbre: 0 rtbf: 0 rtb: 0 \
> >>nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542
> >>(HZ=1000) barrier: 0/0:0
> >
> >Yes, you have many more offline attempts than successes, which is
> >why RCU_HOTPLUG was printed.
> >
> >>=====================================================================
> >>
> >>And here are useful errors:
> >>=====================================================================
> >>kern :info : [ 135.379693] KVM setup async PF for cpu 1
> >>kern :info : [ 135.381412] kvm-stealtime: cpu 1, msr 23fd16180
> >>kern :alert : [ 135.386897] busted_srcud-torture:torture_onoff
> >
> >Just so your know, busted_srcud can sometimes fail by design. Hence
> >the "busted" in the name. But failure didn't happen this time.
> >
>
> Yes..The corner case I mentioned actually happened in every "onoff"
> tests whatever the torture_type is.
>
> >>task: onlined 1
> >>kern :alert : [ 135.408241] busted_srcud-torture:torture_onoff
> >>task: offlining 1
> >>kern :info : [ 135.423310] Unregister pv shared memory for cpu 1
> >>kern :info : [ 135.427940] smpboot: CPU 1 is now offline
> >>kern :alert : [ 135.430106] busted_srcud-torture:torture_onoff
> >>task: offlined 1
> >>kern :alert : [ 135.436404] busted_srcud-torture:torture_onoff
> >>task: offlining 0
> >>kern :alert : [ 135.446173] busted_srcud-torture:torture_onoff
> >>task: offline 0 failed: errno -16
> >>kern :alert : [ 135.453076] busted_srcud-torture:torture_onoff
> >>task: offlining 0
> >>kern :alert : [ 135.457461] busted_srcud-torture:torture_onoff
> >>task: offline 0 failed: errno -16
> >>
> >>
> >>=====================================================================
> >>There are only two CPUs on the VM. Torture try to offline the last one
> >>but -EBUSY occured.
> >>
> >>I spent time to understand kernel/torture.c.
> >>There is torture_onoff():
> >>
> >>225 while (!torture_must_stop()) {
> >>226 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
> >>227 if (!torture_offline(cpu,
> >>228 &n_offline_attempts,
> >>&n_offline_successes,
> >>229 &sum_offline, &min_offline,
> >>&max_offline))
> >>230 torture_online(cpu,
> >>231 &n_online_attempts,
> >>&n_online_successes,
> >>232 &sum_online, &min_online,
> >>&max_online);
> >>233 schedule_timeout_interruptible(onoff_interval);
> >>234 }
> >>235
> >>
> >>torture_offline() and torture_offline() don't pre judge if the current
> >>cpu is only one usable.
> >
> >That does appear to be the case, and that would be a problem with
> >the CONFIG_BOOTPARAM_HOTPLUG_CPU0 listed below.
> >
> >Good catch!
> >
> >>Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
> >>there are only one oneline and hotplugable cpux, then
> >>n_offline_successes != n_offline_attempts which caused "End of test:
> >>RCU_HOTPLUG".
> >>
> >>Does I misunderstand something above? Feel free to correct me.
> >
> >Does the following patch help?
> >
>
> Yes, no more "errnor: -16" in dmesg and "End of test: SUCCESS" is in
> the end.
>
> Thanks for your patch.
> If the patch is to be sent in format, you can add:
>
> Tested-By: Su Yue <suy.fnst@xxxxxxxxxxxxxx>

Very good, applied! And especially thank you for finding this and
bringing it to my attention!!!

Thanx, Paul

> ---
> Su
> > Thanx, Paul
> >
> >------------------------------------------------------------------------
> >
> >diff --git a/kernel/torture.c b/kernel/torture.c
> >index a03ff722352b..2b6700ca2a43 100644
> >--- a/kernel/torture.c
> >+++ b/kernel/torture.c
> >@@ -101,6 +101,8 @@ bool torture_offline(int cpu, long *n_offl_attempts, long *n_offl_successes,
> > if (!cpu_online(cpu) || !cpu_is_hotpluggable(cpu))
> > return false;
> >+ if (num_online_cpus() <= 1)
> >+ return false; /* Can't offline the last CPU. */
> > if (verbose > 1)
> > pr_alert("%s" TORTURE_FLAG
> >
> >
> >
>
>