WARNING at kernel/sched/core.c:2013 migration_cpu_stop+0x2e3/0x330

From: Oleksandr Natalenko
Date: Sun Nov 15 2020 - 17:33:36 EST


Hi.

I'm running v5.10-rc3-rt7 for some time, and I came across this splat in dmesg:

```
[118769.951010] ------------[ cut here ]------------
[118769.951013] WARNING: CPU: 19 PID: 146 at kernel/sched/core.c:2013 migration_cpu_stop+0x2e3/0x330
[118769.951018] Modules linked in: uinput uas usb_storage blocklayoutdriver xt_mark ip6table_nat ip6table_filter ip6_tables rfcomm fuse rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace sunrpc nfs_ssc fscache iptable_nat xt_MASQUERADE nf_nat iptable_filter xt_comment nft_ct nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 cmac algif_hash algif_skcipher nf_tables af_alg snd_hda_codec_realtek nct6775 bnep tun nfnetlink hwmon_vid snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg iwlmvm soundwire_intel soundwire_generic_allocation soundwire_cadence nls_iso8859_1 nls_cp437 vfat edac_mce_amd snd_hda_codec fat eeepc_wmi snd_hda_core mac80211 uvcvideo kvm_amd asus_wmi libarc4 soundwire_bus btusb videobuf2_vmalloc btrtl videobuf2_memops battery btbcm sparse_keymap wmi_bmof snd_usb_audio mxm_wmi videobuf2_v4l2 btintel snd_usbmidi_lib videobuf2_common snd_hwdep iwlwifi snd_soc_core snd_rawmidi bluetooth kvm videodev snd_seq_device snd_compress joydev
[118769.951047] ecdh_generic ac97_bus irqbypass ecc snd_pcm_dmaengine input_leds mousedev mc crc16 r8169 rapl cfg80211 realtek sp5100_tco snd_pcm mdio_devres of_mdio k10temp i2c_piix4 snd_timer rfkill fixed_phy ipmi_devintf igb snd ipmi_msghandler libphy dca soundcore evdev tpm_crb mac_hid tpm_tis tpm_tis_core pinctrl_amd wmi acpi_cpufreq tcp_bbr vhost_vsock vmw_vsock_virtio_transport_common vhost vhost_iotlb vsock msr crypto_user ip_tables x_tables xfs dm_thin_pool dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_crypt cbc encrypted_keys trusted tpm hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid dm_mod raid10 md_mod crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd xhci_pci xhci_pci_renesas ccp cryptd ehci_pci glue_helper xhci_hcd ehci_hcd rng_core amdgpu gpu_sched ttm i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm agpgart
[118769.951079] CPU: 19 PID: 146 Comm: migration/19 Not tainted 5.10.0-pf0 #1
[118769.951080] Hardware name: System manufacturer System Product Name/Pro WS X570-ACE, BIOS 2311 10/16/2020
[118769.951081] Stopper: migration_cpu_stop+0x0/0x330 <- affine_move_task+0x42f/0x620
[118769.951083] RIP: 0010:migration_cpu_stop+0x2e3/0x330
[118769.951084] Code: ff ff 31 db 45 85 ed 0f 89 65 ff ff ff 8b b5 d0 0a 00 00 4c 89 ff e8 cc 43 ff ff 0f b6 d8 66 85 db 75 d8 0f 0b e9 f2 fd ff ff <0f> 0b e9 eb fd ff ff 44 89 ee 4c 89 ff e8 ab 43 ff ff 84 c0 0f 84
[118769.951085] RSP: 0018:ffffb58c806c7e50 EFLAGS: 00010046
[118769.951086] RAX: ffffa136e7a9c300 RBX: 0000000000000000 RCX: 0000000000000000
[118769.951086] RDX: 000000000000000d RSI: 0000000000000013 RDI: ffffa136e7a9bf80
[118769.951087] RBP: ffffa13d0eee99c0 R08: 000000000000002f R09: ffffa13d0ef29af0
[118769.951087] R10: 00000000000000ec R11: 000000000000016a R12: ffffb58c9118fdd0
[118769.951088] R13: 00000000ffffffff R14: ffffa136e7a9c820 R15: ffffa136e7a9bf80
[118769.951089] FS: 0000000000000000(0000) GS:ffffa13d0eec0000(0000) knlGS:0000000000000000
[118769.951089] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[118769.951090] CR2: 00007f687474a000 CR3: 000000021319c000 CR4: 0000000000350ee0
[118769.951090] Call Trace:
[118769.951094] ? set_cpus_allowed_ptr+0x10/0x10
[118769.951095] cpu_stopper_thread+0x89/0x130
[118769.951097] ? smpboot_register_percpu_thread+0xe0/0xe0
[118769.951099] smpboot_thread_fn+0x1d8/0x2c0
[118769.951100] kthread+0x190/0x1b0
[118769.951101] ? __kthread_init_worker+0x50/0x50
[118769.951102] ret_from_fork+0x22/0x30
[118769.951105] CPU: 19 PID: 146 Comm: migration/19 Not tainted 5.10.0-pf0 #1
[118769.951106] Hardware name: System manufacturer System Product Name/Pro WS X570-ACE, BIOS 2311 10/16/2020
[118769.951106] Stopper: migration_cpu_stop+0x0/0x330 <- affine_move_task+0x42f/0x620
[118769.951107] Call Trace:
[118769.951108] dump_stack+0x6d/0x88
[118769.951111] __warn.cold+0x24/0x3d
[118769.951113] ? migration_cpu_stop+0x2e3/0x330
[118769.951114] report_bug+0xd1/0x100
[118769.951116] handle_bug+0x3a/0xa0
[118769.951118] exc_invalid_op+0x15/0xd0
[118769.951119] asm_exc_invalid_op+0x12/0x20
[118769.951121] RIP: 0010:migration_cpu_stop+0x2e3/0x330
[118769.951122] Code: ff ff 31 db 45 85 ed 0f 89 65 ff ff ff 8b b5 d0 0a 00 00 4c 89 ff e8 cc 43 ff ff 0f b6 d8 66 85 db 75 d8 0f 0b e9 f2 fd ff ff <0f> 0b e9 eb fd ff ff 44 89 ee 4c 89 ff e8 ab 43 ff ff 84 c0 0f 84
[118769.951122] RSP: 0018:ffffb58c806c7e50 EFLAGS: 00010046
[118769.951123] RAX: ffffa136e7a9c300 RBX: 0000000000000000 RCX: 0000000000000000
[118769.951123] RDX: 000000000000000d RSI: 0000000000000013 RDI: ffffa136e7a9bf80
[118769.951124] RBP: ffffa13d0eee99c0 R08: 000000000000002f R09: ffffa13d0ef29af0
[118769.951124] R10: 00000000000000ec R11: 000000000000016a R12: ffffb58c9118fdd0
[118769.951124] R13: 00000000ffffffff R14: ffffa136e7a9c820 R15: ffffa136e7a9bf80
[118769.951126] ? set_cpus_allowed_ptr+0x10/0x10
[118769.951127] cpu_stopper_thread+0x89/0x130
[118769.951128] ? smpboot_register_percpu_thread+0xe0/0xe0
[118769.951129] smpboot_thread_fn+0x1d8/0x2c0
[118769.951130] kthread+0x190/0x1b0
[118769.951130] ? __kthread_init_worker+0x50/0x50
[118769.951131] ret_from_fork+0x22/0x30
[118769.951133] ---[ end trace 0000000000000002 ]---
```

which corresponds to the following condition:

```
2007 /*
2008 * When this was migrate_enable() but we no longer have an
2009 * @pending, a concurrent SCA 'fixed' things and we should be
2010 * valid again. Nothing to do.
2011 */
2012 if (!pending) {
2013 WARN_ON_ONCE(!is_cpu_allowed(p, cpu_of(rq)));
2014 goto out;
2015 }
```

I'm not sure what triggered this, and the system still looks usable afterwards. I have no idea how to trigger it again ATM, so this is just a heads up in case you know what could go wrong.

Thanks.

--
Oleksandr Natalenko (post-factum)