Re: [sched] 3cf68f7aae: WARNING:at_kernel/rcu/update.c:#rcutorture_sched_setaffinity

From: Waiman Long
Date: Fri Oct 07 2022 - 14:10:21 EST


On 9/25/22 22:51, kernel test robot wrote:
Hi Waiman,

While this report is for v9 patch, we found that the issue also happens
on v10 patch, so we send this report FYI. Thanks.

Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 3cf68f7aaecb7f1482f3183a5925641632658667 ("[PATCH v9 2/7] sched: Use user_cpus_ptr for saving user provided cpumask in sched_setaffinity()")
url: https://github.com/intel-lab-lkp/linux/commits/Waiman-Long/sched-Persistent-user-requested-affinity/20220917-023345
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 6879c2d3b96039ff1668b4328a4d0dd3ea952cff
patch link: https://lore.kernel.org/lkml/20220916183217.1172225-3-longman@xxxxxxxxxx

in testcase: rcutorture
version:
with following parameters:

runtime: 300s
test: cpuhotplug
torture_type: trivial

test-description: rcutorture is rcutorture kernel module load/unload test.
test-url: https://www.kernel.org/doc/Documentation/RCU/torture.txt
dl_task_check_affinity

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


[ 95.904059][ T543] ------------[ cut here ]------------
[ 95.904506][ T543] rcutorture_sched_setaffinity: sched_setaffinity() returned -22
[ 95.905105][ T543] WARNING: CPU: 0 PID: 543 at kernel/rcu/update.c:495 rcutorture_sched_setaffinity+0x45/0x50
[ 95.905843][ T543] Modules linked in: rcutorture torture
[ 95.906279][ T543] CPU: 0 PID: 543 Comm: rcu_torture_fak Tainted: G W 6.0.0-rc5-00091-g3cf68f7aaecb #1
[ 95.907111][ T543] EIP: rcutorture_sched_setaffinity+0x45/0x50
[ 95.907640][ T543] Code: 94 d0 82 00 75 ee 89 44 24 08 b9 88 5e f9 81 b2 01 89 4c 24 04 c7 04 24 00 d9 5f 82 88 15 8a 94 d0 82 89 45 fc e8 db 4d dc 00 <0f> 0b 8b 45 fc 89 ec 5d c3 66 90 55 31 c0 80 3d dc 65 65 83 00 89
[ 95.909113][ T543] EAX: 0000003e EBX: 00000000 ECX: e756ff14 EDX: 00001794
[ 95.909666][ T543] ESI: ee381d00 EDI: 00000002 EBP: ed37ff44 ESP: ed37ff34
[ 95.910203][ T543] DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0068 EFLAGS: 00010282
[ 95.910816][ T543] CR0: 80050033 CR2: 77d40bb0 CR3: 06a9a000 CR4: 000406b0
[ 95.911357][ T543] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
[ 95.911895][ T543] DR6: fffe0ff0 DR7: 00000400
[ 95.912255][ T543] Call Trace:
[ 95.912529][ T543] synchronize_rcu_trivial+0x57/0xf0 [rcutorture]
[ 95.913034][ T543] rcu_torture_fakewriter+0x108/0x280 [rcutorture]
[ 95.913532][ T543] kthread+0xcd/0xf0
[ 95.913839][ T543] ? rcu_read_unlock+0x60/0x60 [rcutorture]
[ 95.914311][ T543] ? kthread_complete_and_exit+0x20/0x20
[ 95.914734][ T543] ret_from_fork+0x19/0x24
[ 95.915065][ T543] irq event stamp: 66091
[ 95.915386][ T543] hardirqs last enabled at (66099): [<810d5f79>] __up_console_sem+0x69/0x80
[ 95.916046][ T543] hardirqs last disabled at (66106): [<810d5f5d>] __up_console_sem+0x4d/0x80
[ 95.916711][ T543] softirqs last enabled at (66072): [<81f8e014>] __do_softirq+0x284/0x4a3
[ 95.917377][ T543] softirqs last disabled at (66063): [<8100b544>] do_softirq_own_stack+0x24/0x30
[ 95.918043][ T543] ---[ end trace 0000000000000000 ]---


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <yujie.liu@xxxxxxxxx>
| Link: https://lore.kernel.org/r/202209260935.35aecc5a-yujie.liu@xxxxxxxxx


To reproduce:

# build kernel
cd linux
cp config-6.0.0-rc5-00091-g3cf68f7aaecb .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.


The warning happens in the following code:

    490 long rcutorture_sched_setaffinity(pid_t pid, const struct cpumask *in_mask)
    491 {
    492         int ret;
    493
    494         ret = sched_setaffinity(pid, in_mask);
    495         WARN_ONCE(ret, "%s: sched_setaffinity() returned %d\n", __func__, ret);
    496         return ret;
    497 }

    792 static void synchronize_rcu_trivial(void)
    793 {
    794         int cpu;
    795
    796         for_each_online_cpu(cpu) {
    797 rcutorture_sched_setaffinity(current->pid, cpumask_of(cpu));
    798                 WARN_ON_ONCE(raw_smp_processor_id() != cpu);
    799         }
    800 }

The original sched_setaffinity() code hide the failure due to racing in cpuset's cpus_allowed change. So it won't fail even if the new cpu being set is different from the mask given by the caller. The patch changes the code that will return a failure code instead. In this particular failure, the calling synchronize_rcu_trivial() should have triggered its WARN_ON_ONCE() check. Since this warning has been triggered before, it won't show up a second time. As a result, you have two warnings instead of one.

-Longman