Linux-Kernel Archive: [BUG] Outgoing ESP traffic stops after several weeks (XFRM state correct, tunnels established, no ESP output)

[1.] One-line summary of the problem:

Outgoing ESP packets stop being transmitted after several weeks of uptime, even though IPsec tunnels remain established, incoming ESP packets are decrypted correctly, and XFRM states/policies remain valid.

[2.] Full description of the problem/report:

We maintain around 200 IPsec tunnels across approximately 100 remote sites using StrongSwan (IKEv2). All remote nodes connect to a central site that contains three HA clusters (each consisting of two HP servers configured with Corosync + Pacemaker).
The servers have more than 100 CPU cores and 128 GB+ RAM.

Every 3–4 weeks, one of the cluster nodes stops sending ESP packets.
Incoming encrypted ESP packets continue to arrive and are successfully decrypted. IKEv2 re-establishes the tunnels correctly, XFRM policies and states remain intact, routing tables are correct, and nothing unusual appears in dmesg.
However, **all outbound ESP drops to zero**.

Firewall counters confirm:
- ESP input: normal
- ESP output: zero during the failure state

Restarting the affected HA node triggers failover and temporarily resolves the issue.

### Additional observation (IMPORTANT):
We capture traffic every 15 minutes on all interfaces. In the two most recent incidents, immediately before the ESP output failure occurred, tcpdump mis-reported the input/output interface.
Instead of the correct interface (ETH3), tcpdump reported usb0 (ILO) or when I disabled usb0 it showed unknown for in/out interface.
Interface counters confirm that usb0 carries almost no traffic, so the tcpdump interface attribution appears incorrect.

This raises the possibility of:
- an XFRM output path regression,
- an skb device pointer corruption,
- a routing decision inconsistency,
- or a driver-layer issue affecting interface reporting and ESP output.

Upgrading from kernel **6.8.0.52 → 6.8.0.85** did not resolve the issue.

We would appreciate guidance on additional instrumentation or whether this matches any known recent regressions.

[3.] Keywords:

IPsec, XFRM, ESP, StrongSwan, routing, skb, tcpdump, network stack, HA cluster

[4.] Kernel information

[4.1.] Kernel version (/proc/version):
Linux version 6.8.0-85-generic (buildd@lcy02-amd64-024) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #85~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 19 16:18:59 UTC 2

[4.2.] Kernel .config:
is attached

[5.] Most recent kernel version which did not have the bug:

Unknown.
The issue is present in both:
- 6.8.0.52
- 6.8.0.85

[6.] Output of Oops messages:

None. No crashes or warnings in dmesg.

[7.] Example program/script to reproduce:

No minimal reproducer.
Issue appears after several weeks while handling ~200 active IPsec tunnels.
Periodic tcpdump + XFRM/SA dumps available upon request.

[8.] Environment

[8.1.] Software (ver_linux output):
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.5 LTS
Release: 22.04
Codename: jammy

[8.2.] Processor information (/proc/cpuinfo):
Interl(R) Xeon(R) Gold 6230R CPU @ 2.10GH (104 cores)

[8.3.] Module information (/proc/modules):
...

[8.4.] Loaded driver and hardware information:
....

[8.5.] PCI information (lspci -vvv):
...

[8.6.] SCSI information (/proc/scsi/scsi):
....

[8.7.] Additional relevant information:

- ~200 IKEv2 tunnels via StrongSwan
- XFRM policies/states valid during failure
- Incoming ESP continues to decrypt
- Outgoing ESP stops completely
- tcpdump reports wrong interface (usb0 instead of ETH3) shortly before failure
- NIC is HP server onboard interface
- HA failover restores functionality temporarily

[X.] Other notes, patches, workarounds:

Restarting the affected node forces HA failover and restores traffic temporarily.
Kernel upgrade did not solve the issue.
StrongSwan logs show no IKE or CHILD_SA issues.