WARNING at net/mac80211/sta_info.c:1057 (__sta_info_destroy_part2())

From: Linus Torvalds
Date: Wed Sep 11 2019 - 06:06:08 EST


So I'm at LCA, reading email, using my laptop more than I normally do,
and with different networking than I normally do.

And I just had a 802.11 WARN_ON() trigger, followed by essentially a
dead machine due to some lock held (maybe rtnl_lock).

It's possible that the lock held thing happened before, and is the
_reason_ for the delay, I don't know. I had to reboot the machine, but
I gathered as much information as made sense and was obvious before I
did so. That's appended.

Linus

---
Selected dmesg from the WARN_ON_ONCE() and just before, followed by
info from "sysrq-w":

Previous resume looks normal:

PM: suspend exit
ath10k_pci 0000:02:00.0: UART prints enabled
ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
wlp2s0: authenticate with 54:ec:2f:05:70:2c
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 2/3)
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 3/3)
wlp2s0: authentication with 54:ec:2f:05:70:2c timed out
wlp2s0: authenticate with 54:ec:2f:05:70:28
wlp2s0: send auth to 54:ec:2f:05:70:28 (try 1/3)
wlp2s0: authenticated
wlp2s0: associate with 54:ec:2f:05:70:28 (try 1/3)
wlp2s0: RX AssocResp from 54:ec:2f:05:70:28 (capab=0x1431 status=0 aid=1)
wlp2s0: associated
wlp2s0: Limiting TX power to 20 (20 - 0) dBm as advertised by
54:ec:2f:05:70:28
ath: EEPROM regdomain: 0x826c
ath: EEPROM indicates we should expect a country code
ath: doing EEPROM country->regdmn map search
ath: country maps to regdmn code: 0x37
ath: Country alpha2 being used: PT
ath: Regpair used: 0x37
ath: regdomain 0x826c dynamically updated by country element
IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready
wlp2s0: disconnect from AP 54:ec:2f:05:70:28 for new auth to
54:ec:2f:05:70:2c
wlp2s0: authenticate with 54:ec:2f:05:70:2c
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: authenticated
wlp2s0: associate with 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: RX ReassocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
wlp2s0: associated
ath: EEPROM regdomain: 0x826c
ath: EEPROM indicates we should expect a country code
ath: doing EEPROM country->regdmn map search
ath: country maps to regdmn code: 0x37
ath: Country alpha2 being used: PT
ath: Regpair used: 0x37
ath: regdomain 0x826c dynamically updated by country element
wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
54:ec:2f:05:70:2c

Another _almost_ successful suspend/resume:

PM: suspend entry (s2idle)
Filesystems sync: 0.028 seconds
Freezing user space processes ... (elapsed 0.126 seconds) done.
OOM killer disabled.
Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
printk: Suspending console(s) (use no_console_suspend to debug)
wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
(Reason: 3=DEAUTH_LEAVING)
ath10k_pci 0000:02:00.0: UART prints enabled
ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
OOM killer enabled.
Restarting tasks ... done.
PM: suspend exit
ath10k_pci 0000:02:00.0: UART prints enabled
ath10k_pci 0000:02:00.0: unsupported HTC service id: 1536
wlp2s0: authenticate with 54:ec:2f:05:70:2c
wlp2s0: send auth to 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: authenticated
wlp2s0: associate with 54:ec:2f:05:70:2c (try 1/3)
wlp2s0: RX AssocResp from 54:ec:2f:05:70:2c (capab=0x1011 status=0 aid=2)
wlp2s0: associated
wlp2s0: Limiting TX power to 23 (23 - 0) dBm as advertised by
54:ec:2f:05:70:2c
ath: EEPROM regdomain: 0x826c
ath: EEPROM indicates we should expect a country code
ath: doing EEPROM country->regdmn map search
ath: country maps to regdmn code: 0x37
ath: Country alpha2 being used: PT
ath: Regpair used: 0x37
ath: regdomain 0x826c dynamically updated by country element
IPv6: ADDRCONF(NETDEV_CHANGE): wlp2s0: link becomes ready

I say _almost_, because I don't see the "No TX power limit" for the
country lookup (yes, Portugal) this time?

But wait!

... then 10+ minutes later:

ath10k_pci 0000:02:00.0: wmi command 16387 timeout, restarting hardware
ath10k_pci 0000:02:00.0: failed to set 5g txpower 23: -11
ath10k_pci 0000:02:00.0: failed to setup tx power 23: -11
ath10k_pci 0000:02:00.0: failed to recalc tx power: -11
ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to setup powersave: -108

That certainly looks like something did try to set a power limit, but
eventually failed.

Immediately after that:

wlp2s0: deauthenticating from 54:ec:2f:05:70:2c by local choice
(Reason: 3=DEAUTH_LEAVING)
ath10k_pci 0000:02:00.0: failed to read hi_board_data address: -16
ath10k_pci 0000:02:00.0: failed to receive initialized event from
target: 00000000
ath10k_pci 0000:02:00.0: failed to receive initialized event from
target: 00000000
ath10k_pci 0000:02:00.0: failed to wait for target init: -110
ieee80211 phy0: Hardware restart was requested
ath10k_pci 0000:02:00.0: failed to set inactivity time for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to setup powersave: -108
ath10k_pci 0000:02:00.0: failed to set PS Mode 0 for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to setup powersave: -108
ath10k_pci 0000:02:00.0: failed to setup ps on vdev 0: -108
ath10k_pci 0000:02:00.0: failed to flush transmit queue (skip 1
ar-state 2): 5000
ath10k_pci 0000:02:00.0: failed to delete peer 54:ec:2f:05:70:2c
for vdev 0: -108

and this then results in:

WARNING: CPU: 4 PID: 1246 at net/mac80211/sta_info.c:1057
__sta_info_destroy_part2+0x147/0x150 [mac80211]
Modules linked in: ccm fuse rfcomm xt_CHECKSUM xt_MASQUERADE tun
bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT
ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat
ip6table_nat ip6table_mangle ip6table_raw ip6table_security
iptable_nat nf_nat iptable_mangle iptable_raw iptable_security
nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ip_set nfnetlink
ebtable_filter ebtables ip6table_filter ip6_tables cmac bnep vfat fat
hid_multitouch iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi
snd_soc_skl snd_soc_hdac_hda snd_hda_ext_core snd_hda_codec_realtek
snd_soc_core snd_hda_codec_generic snd_soc_acpi_intel_match
snd_soc_acpi snd_soc_skl_ipc snd_soc_sst_ipc snd_soc_sst_dsp
dell_laptop snd_hda_intel ledtrig_audio dell_wmi dell_smbios
x86_pkg_temp_thermal snd_hda_codec ath10k_pci intel_powerclamp
wmi_bmof dell_wmi_descriptor intel_wmi_thunderbolt snd_hwdep coretemp
ath10k_core snd_hda_core kvm_intel snd_seq ath snd_seq_device kvm
btusb btrtl
intel_rapl_msr btbcm btintel mac80211 irqbypass snd_pcm bluetooth
intel_cstate dcdbas intel_uncore intel_rapl_perf joydev snd_timer snd
ecdh_generic i2c_i801 cfg80211 ecc idma64 soundcore rtsx_pci_ms
processor_thermal_device mei_me memstick rfkill libarc4 intel_lpss_pci
ucsi_acpi intel_soc_dts_iosf mei typec_ucsi intel_lpss
intel_pch_thermal intel_rapl_common typec wmi int3403_thermal
int340x_thermal_zone intel_hid sparse_keymap int3400_thermal acpi_pad
acpi_thermal_rel dm_crypt i915 crct10dif_pclmul crc32_pclmul
crc32c_intel rtsx_pci_sdmmc mmc_core i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt nvme fb_sys_fops ghash_clmulni_intel
drm serio_raw nvme_core rtsx_pci i2c_hid pinctrl_cannonlake video
pinctrl_intel
CPU: 4 PID: 1246 Comm: NetworkManager Not tainted
5.3.0-rc8-00004-g56037cadf604-dirty #5
Hardware name: Dell Inc. XPS 13 9380/0KTW76, BIOS 1.5.0 06/03/2019
RIP: 0010:__sta_info_destroy_part2+0x147/0x150 [mac80211]
Code: bd 0c 01 00 00 00 0f 84 4e ff ff ff 45 31 c0 b9 01 00 00 00
48 89 ea 48 89 de 4c 89 e7 e8 e1 aa ff ff 85 c0 0f 84 30 ff ff ff <0f>
0b e9 29 ff ff ff 66 90 0f 1f 44 00 00 41 54 55 48 89 fd e8 40
RSP: 0018:ffff98fc00daba90 EFLAGS: 00010282
RAX: 00000000ffffff94 RBX: ffff8e1355ad2880 RCX: 0000000000000000
RDX: ffff8e1355901ec0 RSI: 0000000000000000 RDI: ffff8e1357902e38
RBP: ffff8e135bb08000 R08: 0000000000000000 R09: 0000000000000574
R10: 000000000001cc5c R11: 0000000000000000 R12: ffff8e13579007a0
R13: ffff8e1355ad2880 R14: 0000000000000001 R15: ffff8e1357900d58
FS: 00007fbd99372bc0(0000) GS:ffff8e135e500000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00001cdbd7408000 CR3: 00000004708a2001 CR4: 00000000003606e0
Call Trace:
__sta_info_flush+0x125/0x170 [mac80211]
ieee80211_set_disassoc+0xc2/0x590 [mac80211]
ieee80211_mgd_deauth.cold+0x4a/0x1b8 [mac80211]
cfg80211_mlme_deauth+0xb3/0x1d0 [cfg80211]
? startup_64+0x3/0x30
cfg80211_mlme_down+0x66/0x90 [cfg80211]
cfg80211_disconnect+0x129/0x1e0 [cfg80211]
cfg80211_leave+0x27/0x40 [cfg80211]
cfg80211_netdev_notifier_call+0x1a7/0x4e0 [cfg80211]
? ieee80211_iter_chan_contexts_atomic+0x3e/0x50 [mac80211]
? ath10k_monitor_recalc+0x38f/0x5d0 [ath10k_core]
? __schedule+0x143/0x660
? ath10k_config_ps+0x4e/0x70 [ath10k_core]
? inetdev_event+0x46/0x560
notifier_call_chain+0x4c/0x70
__dev_close_many+0x57/0x100
dev_close_many+0x8d/0x140
dev_close.part.0+0x44/0x70
cfg80211_shutdown_all_interfaces+0x71/0xd0 [cfg80211]
cfg80211_rfkill_set_block+0x22/0x30 [cfg80211]
rfkill_set_block+0x92/0x140 [rfkill]
rfkill_fop_write+0x11f/0x1c0 [rfkill]
vfs_write+0xb6/0x1a0
ksys_write+0xa7/0xe0
do_syscall_64+0x59/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7fbd9a36c92f
Code: 89 54 24 18 48 89 74 24 10 89 7c 24 08 e8 19 fd ff ff 48 8b
54 24 18 48 8b 74 24 10 41 89 c0 8b 7c 24 08 b8 01 00 00 00 0f 05 <48>
3d 00 f0 ff ff 77 2d 44 89 c7 48 89 44 24 08 e8 4c fd ff ff 48
RSP: 002b:00007fff9a1ee220 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000055f18ef5eaad RCX: 00007fbd9a36c92f
RDX: 0000000000000008 RSI: 00007fff9a1ee258 RDI: 0000000000000019
RBP: 0000000000000019 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000293 R12: 000055f190502090
R13: 0000000000000000 R14: 0000000000000000 R15: 000055f1904ffbb0
---[ end trace 4d7c87877e20badd ]---
ath10k_pci 0000:02:00.0: failed to recalculate rts/cts prot for vdev 0: -108
ath10k_pci 0000:02:00.0: failed to set cts protection for vdev 0: -108

and it looks like it leaves some lock held.

Because after this, nothing network-related stops working, and the
machine eventually dies because workqueues don't make any progress (due
to rtnl_lock, I suspect):

Workqueue: events_freezable ieee80211_restart_work [mac80211]
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
ieee80211_restart_work+0x54/0xe0 [mac80211]
process_one_work+0x1cf/0x370
worker_thread+0x4a/0x3c0
kthread+0xfb/0x130
ret_from_fork+0x35/0x40

Workqueue: events disconnect_work [cfg80211]
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
disconnect_work+0x12/0xd0 [cfg80211]
process_one_work+0x1cf/0x370
worker_thread+0x4a/0x3c0
kthread+0xfb/0x130
ret_from_fork+0x35/0x40
kworker/3:0 D 0 6313 2 0x80004000


Workqueue: events linkwatch_event
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
linkwatch_event+0xa/0x30
process_one_work+0x1cf/0x370
worker_thread+0x4a/0x3c0
kthread+0xfb/0x130
ret_from_fork+0x35/0x40

ping D 0 6619 6568 0x80000002
Call Trace:
schedule+0x39/0xa0
schedule_preempt_disabled+0xa/0x10
__mutex_lock.isra.0+0x263/0x4b0
ip6mr_sk_done+0x2e/0xd0
rawv6_close+0x19/0x30
inet_release+0x34/0x60
__sock_release+0x3d/0xa0
sock_close+0x11/0x20
__fput+0xc1/0x250
task_work_run+0x81/0xa0
do_exit+0x2e5/0xaa0
do_group_exit+0x3a/0x90
__x64_sys_exit_group+0x14/0x20
do_syscall_64+0x59/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9

gdbus D 0 6629 1949 0x80004000
Call Trace:
schedule+0x39/0xa0
do_exit+0x1da/0xaa0
do_group_exit+0x3a/0x90
get_signal+0x14f/0x830
do_signal+0x36/0x620
exit_to_usermode_loop+0x6f/0xf0
do_syscall_64+0x17d/0x1c0
entry_SYSCALL_64_after_hwframe+0x44/0xa9