Re: 6.18.13 iwlwifi deadlock allocating cma while work-item is active.

From: Ben Greear

Date: Tue Mar 10 2026 - 15:21:51 EST


On 3/10/26 11:06, Tejun Heo wrote:
Hello,

Thanks for the detailed dump. One thing that doesn't look right is the
number of pending work items on pool 22 (CPU 5). The pool reports 2 idle
workers, yet there are 7+ work items sitting in the pending list across
multiple workqueues. If the pool were making forward progress, those items
would have been picked up by the idle workers. So, the pool itself seems to
be stuck for some reason, and the cfg80211 mutex stall may be a consequence
rather than the cause.

Let's try using drgn on the crash dump. I'm attaching a prompt that you can
feed to Claude (or any LLM with tool access to drgn). It contains workqueue
internals documentation, drgn code snippets, and a systematic investigation
procedure. The idea is:

1. Generate the crash dump when the deadlock is happening:

echo c > /proc/sysrq-trigger

2. After the crash kernel boots, create the dump file:

makedumpfile -c -d 31 /proc/vmcore /tmp/vmcore.dmp

3. Feed the attached prompt to Claude with drgn access to the dump. It
should produce a Markdown report with its findings that you can post
back here.

This is a bit experimental, so let's see whether it works. Either way, the
report should at least give us concrete data points to work with.

Thanks.

Thanks for that. It will probably be a few days before I flip back to debugging
that lockup as I'm trying to get something ready for our internal release (using
kthread work-around).

While working on another bug, I found evidence (but not proof yet), that this code below
can be called multiple times for the same object. The bug I'm tracking is that this
may be the cause of list corruption (my debugging logs and work-arounds are in the method below).

But could this work-item (re)initialization also explain work-queue system going
weird? Just using kthreads, which 'fixes' the problem for me,
really shouldn't make a difference to the code below, so probably
it is not related?


void ieee80211_link_init(struct ieee80211_sub_if_data *sdata,
int link_id,
struct ieee80211_link_data *link,
struct ieee80211_bss_conf *link_conf)
{
struct ieee80211_local *local = sdata->local;
bool deflink = link_id < 0;

lockdep_assert_wiphy(local->hw.wiphy);

if (link_id < 0)
link_id = 0;

if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN) {
struct ieee80211_sub_if_data *ap_bss;
struct ieee80211_bss_conf *ap_bss_conf;

ap_bss = container_of(sdata->bss,
struct ieee80211_sub_if_data, u.ap);
ap_bss_conf = sdata_dereference(ap_bss->vif.link_conf[link_id],
ap_bss);
memcpy(link_conf, ap_bss_conf, sizeof(*link_conf));
}

link->sdata = sdata;
link->link_id = link_id;
link->conf = link_conf;
link_conf->link_id = link_id;
link_conf->vif = &sdata->vif;
link->ap_power_level = IEEE80211_UNSET_POWER_LEVEL;
link->user_power_level = sdata->local->user_power_level;
link_conf->txpower = INT_MIN;

wiphy_work_init(&link->csa.finalize_work,
ieee80211_csa_finalize_work);
wiphy_work_init(&link->color_change_finalize_work,
ieee80211_color_change_finalize_work);
wiphy_delayed_work_init(&link->color_collision_detect_work,
ieee80211_color_collision_detection_work);
/* I see some sort of list corruption where links don't get removed from chanctx
* lists. I think if we are in a list while here, that could cause it. deflink
* appears to have chance of doing that. So, remove from list first if
* it is indeed in one.
*/
if (WARN_ON_ONCE((link->assigned_chanctx_list.next != LIST_POISON1)
&& (link->assigned_chanctx_list.next != link->assigned_chanctx_list.prev)
&& (link->assigned_chanctx_list.next))) {
sdata_err(sdata, "link-init: %d called while already in an assigned-chan-ctx list, clearing.\n",
link_id);
list_del(&link->assigned_chanctx_list);
}
if (WARN_ON_ONCE((link->reserved_chanctx_list.next != LIST_POISON1)
&& (link->reserved_chanctx_list.next != link->reserved_chanctx_list.prev)
&& (link->reserved_chanctx_list.next))) {
sdata_err(sdata, "link-init: %d called while already in a reserved-chan-ctx list, clearing.\n",
link_id);
list_del(&link->reserved_chanctx_list);
}

INIT_LIST_HEAD(&link->assigned_chanctx_list);
INIT_LIST_HEAD(&link->reserved_chanctx_list);
wiphy_delayed_work_init(&link->dfs_cac_timer_work,
ieee80211_dfs_cac_timer_work);

if (!deflink) {
switch (sdata->vif.type) {
case NL80211_IFTYPE_AP:
case NL80211_IFTYPE_AP_VLAN:
ether_addr_copy(link_conf->addr,
sdata->wdev.links[link_id].addr);
link_conf->bssid = link_conf->addr;
WARN_ON(!(sdata->wdev.valid_links & BIT(link_id)));
break;
case NL80211_IFTYPE_STATION:
/* station sets the bssid in ieee80211_mgd_setup_link */
break;
default:
WARN_ON(1);
}

ieee80211_link_debugfs_add(link);
}

rcu_assign_pointer(sdata->vif.link_conf[link_id], link_conf);
rcu_assign_pointer(sdata->link[link_id], link);
}


Thanks,
Ben

--
Ben Greear <greearb@xxxxxxxxxxxxxxx>
Candela Technologies Inc http://www.candelatech.com