Re: rcutorture: meaning of "End of test: RCU_HOTPLUG"

From: Su Yue
Date: Tue Jan 22 2019 - 03:33:45 EST


Thanks for your quick reply! Paul

On 1/22/19 12:01 PM, Paul E. McKenney wrote:
On Tue, Jan 22, 2019 at 11:40:53AM +0800, Su Yue wrote:
Hi, guys
While running rcutorture tests with "onoff_interval", some tests
failed and results show like:

=====================================================================
[ 316.354501] srcud-torture:--- End of test: RCU_HOTPLUG:
nreaders=1 nfakewriters=4 stat_interval=60 verbose=2
test_no_idle_hz=1 shuffle_interval=3 stutter=5 irqreader=1 fq\
s_duration=0 fqs_holdoff=0 fqs_stutter=3 test_boost=1/0
test_boost_interval=7 test_boost_duration=4 shutdown_secs=0
stall_cpu=0 stall_cpu_holdoff=10 stall_cpu_irqsoff=0 n_ba\
rrier_cbs=0 onoff_interval=3 onoff_holdoff=0
====================================================================

I am wondering that meaning of "RCU_HOTPLUG". Is it expected because
cpu hotplug is enabled in the test? Or just represents another type of
failure?

This says that at least one CPU hotplug operation failed, that is,
the CPU didn't actually come online or go offline as requested. If you
are introducing CPU hotplug to an architecture, this usually indicates
that you have bugs in your CPU-hotplug code. Or it nmight be that

It should hit the case since there is no RCU CPU stall warnings.

RCU grace periods failed to progress -- though this would normally
also result in RCU CPU stall warnings.

There should be lines containing "ver:" in your console output. What
does one of the later one of these say?


The line says:
======================================================================
[ 318.850175] busted_srcud-torture: rtc: (null) ver: 27040 tfle: 0 rta: 27040 rtaf: 0 rtf: 27027 rtmbe: 0 rtbe: 0 rtbke: 0 rtbre: 0 rtbf: 0 rtb: 0 \
nt: 9497 onoff: 2639/2639:2640/5310 40,373:10,355 162868:67542 (HZ=1000) barrier: 0/0:0

=====================================================================

And here are useful errors:
=====================================================================
kern :info : [ 135.379693] KVM setup async PF for cpu 1
kern :info : [ 135.381412] kvm-stealtime: cpu 1, msr 23fd16180
kern :alert : [ 135.386897] busted_srcud-torture:torture_onoff task: onlined 1
kern :alert : [ 135.408241] busted_srcud-torture:torture_onoff task: offlining 1
kern :info : [ 135.423310] Unregister pv shared memory for cpu 1
kern :info : [ 135.427940] smpboot: CPU 1 is now offline
kern :alert : [ 135.430106] busted_srcud-torture:torture_onoff task: offlined 1
kern :alert : [ 135.436404] busted_srcud-torture:torture_onoff task: offlining 0
kern :alert : [ 135.446173] busted_srcud-torture:torture_onoff task: offline 0 failed: errno -16
kern :alert : [ 135.453076] busted_srcud-torture:torture_onoff task: offlining 0
kern :alert : [ 135.457461] busted_srcud-torture:torture_onoff task: offline 0 failed: errno -16


=====================================================================
There are only two CPUs on the VM. Torture try to offline the last one
but -EBUSY occured.

I spent time to understand kernel/torture.c.
There is torture_onoff():

225 while (!torture_must_stop()) {
226 cpu = (torture_random(&rand) >> 4) % (maxcpu + 1);
227 if (!torture_offline(cpu,
228 &n_offline_attempts, &n_offline_successes,
229 &sum_offline, &min_offline, &max_offline))
230 torture_online(cpu,
231 &n_online_attempts, &n_online_successes,
232 &sum_online, &min_online, &max_online);
233 schedule_timeout_interruptible(onoff_interval);
234 }
235

torture_offline() and torture_offline() don't pre judge if the current
cpu is only one usable.

Our test machines are configured with CONFIG_BOOTPARAM_HOTPLUG_CPU0. If
there are only one oneline and hotplugable cpux, then
n_offline_successes != n_offline_attempts which caused "End of test:
RCU_HOTPLUG".

Does I misunderstand something above? Feel free to correct me.


Thanks,
Su

Thanx, Paul