[PATCH AUTOSEL 6.19-6.18] genirq/cpuhotplug: Notify about affinity changes breaking the affinity mask

From: Sasha Levin

Date: Wed Feb 11 2026 - 20:12:03 EST


From: Imran Khan <imran.f.khan@xxxxxxxxxx>

[ Upstream commit dd9f6d30c64001ca4dde973ac04d8d155e856743 ]

During CPU offlining the interrupts affined to that CPU are moved to other
online CPUs, which might break the original affinity mask if the outgoing
CPU was the last online CPU in that mask. This change is not propagated to
irq_desc::affinity_notify(), which leaves users of the affinity notifier
mechanism with stale information.

Avoid this by scheduling affinity change notification work for interrupts
that were affined to the CPU being offlined, if the new target CPU is not
part of the original affinity mask.

Since irq_set_affinity_locked() uses the same logic to schedule affinity
change notification work, split out this logic into a dedicated function
and use that at both places.

[ tglx: Removed the EXPORT(), removed the !SMP stub, moved the prototype,
added a lockdep assert instead of a comment, fixed up coding style
and name space. Polished and clarified the change log ]

Signed-off-by: Imran Khan <imran.f.khan@xxxxxxxxxx>
Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxx>
Link: https://patch.msgid.link/20260113143727.1041265-1-imran.f.khan@xxxxxxxxxx
Signed-off-by: Sasha Levin <sashal@xxxxxxxxxx>
---

LLM Generated explanations, may be completely bogus:

Now I have all the information needed for a comprehensive analysis. Let
me compile my findings.

---

## Detailed Analysis

### 1. COMMIT MESSAGE ANALYSIS

The commit subject and body describe a clear, concrete bug: during CPU
offlining, interrupts are migrated away from the offlined CPU. When the
new target CPU is **not** part of the original affinity mask (i.e., the
affinity was "broken"), the `irq_desc::affinity_notify` mechanism is NOT
triggered. This leaves users of the affinity notifier (network drivers,
RFS/RPS, IB HFI1, etc.) with **stale CPU affinity information**.

Key details:
- Author: Imran Khan (Oracle), a recognized contributor
- Signed-off-by and edited by Thomas Gleixner (IRQ subsystem
maintainer), who personally polished the changelog and code
- The `[tglx: ...]` notes show significant maintainer review and
modification

### 2. CODE CHANGE ANALYSIS

The patch makes three changes across three files:

**a) `kernel/irq/manage.c` - Refactoring + bug fix enablement**

The inline notification logic in `irq_set_affinity_locked()`:

```350:381:kernel/irq/manage.c
int irq_set_affinity_locked(struct irq_data *data, const struct cpumask
*mask,
bool force)
{
// ... elided ...
if (desc->affinity_notify) {
kref_get(&desc->affinity_notify->kref);
if (!schedule_work(&desc->affinity_notify->work)) {
/* Work was already scheduled, drop our extra
ref */
kref_put(&desc->affinity_notify->kref,
desc->affinity_notify->release);
}
}
```

is extracted into a new function `irq_affinity_schedule_notify_work()`,
and the original site is replaced by a call to this new function. This
is a pure refactoring at this call site -- no behavioral change.

The new function:

```python
void irq_affinity_schedule_notify_work(struct irq_desc *desc)
{
lockdep_assert_held(&desc->lock);
kref_get(&desc->affinity_notify->kref);
if (!schedule_work(&desc->affinity_notify->work)) {
kref_put(&desc->affinity_notify->kref,
desc->affinity_notify->release);
}
}
```

This is identical logic to what was inline, plus a
`lockdep_assert_held()` for safety.

**b) `kernel/irq/cpuhotplug.c` - The actual bug fix**

In `irq_migrate_all_off_this_cpu()`, the patch adds:

```python
scoped_guard(raw_spinlock, &desc->lock) {
affinity_broken = migrate_one_irq(desc);
if (affinity_broken && desc->affinity_notify)
irq_affinity_schedule_notify_work(desc);
}
```

This is the core fix: when `migrate_one_irq()` returns `true` (affinity
was broken), and the descriptor has an affinity notifier registered,
**schedule the notification work**. This is done inside the lock scope,
which is correct since `irq_affinity_schedule_notify_work` asserts
`desc->lock` is held.

**c) `kernel/irq/internals.h` - Declaration**

Adds `extern void irq_affinity_schedule_notify_work(struct irq_desc
*desc);` to the internal header.

### 3. BUG MECHANISM AND IMPACT

**Root cause:** The CPU hotplug IRQ migration path
(`irq_migrate_all_off_this_cpu` -> `migrate_one_irq` ->
`irq_do_set_affinity`) bypasses `irq_set_affinity_locked()` and calls
the lower-level `irq_do_set_affinity()` directly. The notification logic
was only in `irq_set_affinity_locked()`, so CPU hotplug IRQ migrations
**never** triggered affinity notifications.

**Who is affected:**

1. **`irq_cpu_rmap` (lib/cpu_rmap.c)** - Used by major network drivers
(bnxt, sfc, mlx5, mlx4, hns3, i40e, qede, enic) for Receive Flow
Steering (RFS). When CPU hotplug migrates IRQs, the rmap becomes
stale, causing incorrect RFS steering decisions. This has existed
since `cpuhotplug.c` was created in 2015.

2. **NAPI affinity tracking** (net/core/dev.c `netif_napi_irq_notify`) -
Newer mechanism (v6.15+) where NAPI tracks IRQ CPU affinity. Stale
data here means incorrect CPU affinity tracking after CPU offlining.

3. **InfiniBand HFI1 driver** (drivers/infiniband/hw/hfi1/affinity.c) -
Uses affinity notifiers for SDMA engine management.

4. **Various other network drivers** (ionic, i40e, funeth, bnxt,
qla2xxx, qedf) that directly use `irq_set_affinity_notifier()`.

**Real-world impact:** On systems doing CPU hotplug (common in cloud/VM
environments, power management, CPU isolation), network performance
degrades because RFS steering becomes incorrect. The CPU rmap points to
offline CPUs or wrong CPUs, causing suboptimal packet delivery.

### 4. CLASSIFICATION

This is a **real bug fix** -- not a feature, not cleanup. It fixes a
missing notification that causes stale data in multiple critical
subsystems (networking RFS, NAPI, InfiniBand).

### 5. SCOPE AND RISK ASSESSMENT

- **Lines changed:** ~30 lines (very small)
- **Files touched:** 3, all in `kernel/irq/` (contained)
- **Complexity:** Low. The new function is a direct extraction of
existing inline code. The new call site is a straightforward
conditional.
- **Risk of regression:** Very low. The notification work was already
being scheduled from `irq_set_affinity_locked()` using the identical
logic. Adding it to the CPU hotplug path uses the same proven
mechanism.
- **Subsystem:** `kernel/irq/` - core infrastructure, but the change is
very surgical

### 6. DEPENDENCY ANALYSIS

**Critical concern:** The patch as written uses
`scoped_guard(raw_spinlock, &desc->lock)` syntax, which was introduced
by commit `88a4df117ad66` ("genirq/cpuhotplug: Convert to lock guards")
merged in v6.16. Older stable trees (6.15 and below) use
`raw_spin_lock()`/`raw_spin_unlock()` directly, so **this patch would
not apply cleanly to those trees**.

For stable trees <= 6.15, a backport would need adaptation to use the
old locking style:

```c
raw_spin_lock(&desc->lock);
affinity_broken = migrate_one_irq(desc);
if (affinity_broken && desc->affinity_notify)
irq_affinity_schedule_notify_work(desc);
raw_spin_unlock(&desc->lock);
```

This adaptation is trivial and mechanically straightforward.

### 7. STABILITY AND TESTING

- Thomas Gleixner (IRQ subsystem maintainer) personally reviewed,
edited, and applied this patch
- The logic reuses proven existing code (the kref_get + schedule_work +
kref_put pattern)
- The `lockdep_assert_held()` adds runtime verification in debug builds

### 8. CONCLUSION

This fixes a **real, longstanding bug** (since 2015) that causes stale
IRQ affinity information in the CPU hotplug path, affecting network
drivers using RFS (bnxt, sfc, mlx5, mlx4, hns3, i40e, qede, enic), NAPI
affinity tracking, and InfiniBand HFI1. The impact is concrete:
incorrect receive flow steering and CPU affinity tracking when CPUs go
offline, which is common in cloud, VM, and power-managed environments.

The fix is:
- Small and surgical (~30 lines)
- Obviously correct (reuses existing proven notification logic)
- Well-reviewed (Thomas Gleixner personally edited it)
- Low risk of regression
- Self-contained (no part of a series)
- Fixes a real user-visible issue (stale CPU affinity data)

The only concern is that it requires a trivial backport adaptation for
stable trees <= 6.15 due to the lock guard conversion. This is a
mechanical change.

**YES**

kernel/irq/cpuhotplug.c | 6 ++++--
kernel/irq/internals.h | 2 +-
kernel/irq/manage.c | 26 ++++++++++++++++++--------
3 files changed, 23 insertions(+), 11 deletions(-)

diff --git a/kernel/irq/cpuhotplug.c b/kernel/irq/cpuhotplug.c
index 755346ea98196..cd5689e383b00 100644
--- a/kernel/irq/cpuhotplug.c
+++ b/kernel/irq/cpuhotplug.c
@@ -177,9 +177,11 @@ void irq_migrate_all_off_this_cpu(void)
bool affinity_broken;

desc = irq_to_desc(irq);
- scoped_guard(raw_spinlock, &desc->lock)
+ scoped_guard(raw_spinlock, &desc->lock) {
affinity_broken = migrate_one_irq(desc);
-
+ if (affinity_broken && desc->affinity_notify)
+ irq_affinity_schedule_notify_work(desc);
+ }
if (affinity_broken) {
pr_debug_ratelimited("IRQ %u: no longer affine to CPU%u\n",
irq, smp_processor_id());
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index 0164ca48da59e..5568ed3a8b852 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -135,6 +135,7 @@ extern bool irq_can_set_affinity_usr(unsigned int irq);

extern int irq_do_set_affinity(struct irq_data *data,
const struct cpumask *dest, bool force);
+extern void irq_affinity_schedule_notify_work(struct irq_desc *desc);

#ifdef CONFIG_SMP
extern int irq_setup_affinity(struct irq_desc *desc);
@@ -142,7 +143,6 @@ extern int irq_setup_affinity(struct irq_desc *desc);
static inline int irq_setup_affinity(struct irq_desc *desc) { return 0; }
#endif

-
#define for_each_action_of_desc(desc, act) \
for (act = desc->action; act; act = act->next)

diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 349ae7979da0e..4873b0f73df96 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -347,6 +347,21 @@ static bool irq_set_affinity_deactivated(struct irq_data *data,
return true;
}

+/**
+ * irq_affinity_schedule_notify_work - Schedule work to notify about affinity change
+ * @desc: Interrupt descriptor whose affinity changed
+ */
+void irq_affinity_schedule_notify_work(struct irq_desc *desc)
+{
+ lockdep_assert_held(&desc->lock);
+
+ kref_get(&desc->affinity_notify->kref);
+ if (!schedule_work(&desc->affinity_notify->work)) {
+ /* Work was already scheduled, drop our extra ref */
+ kref_put(&desc->affinity_notify->kref, desc->affinity_notify->release);
+ }
+}
+
int irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask,
bool force)
{
@@ -367,14 +382,9 @@ int irq_set_affinity_locked(struct irq_data *data, const struct cpumask *mask,
irq_copy_pending(desc, mask);
}

- if (desc->affinity_notify) {
- kref_get(&desc->affinity_notify->kref);
- if (!schedule_work(&desc->affinity_notify->work)) {
- /* Work was already scheduled, drop our extra ref */
- kref_put(&desc->affinity_notify->kref,
- desc->affinity_notify->release);
- }
- }
+ if (desc->affinity_notify)
+ irq_affinity_schedule_notify_work(desc);
+
irqd_set(data, IRQD_AFFINITY_SET);

return ret;
--
2.51.0