[1.] One-line summary of the problem:
Outgoing ESP packets stop being transmitted after several weeks of uptime, even though IPsec tunnels remain established, incoming ESP packets are decrypted correctly, and XFRM states/policies remain valid.
[2.] Full description of the problem/report:
We maintain around 200 IPsec tunnels across approximately 100 remote sites using StrongSwan (IKEv2). All remote nodes connect to a central site that contains three HA clusters (each consisting of two HP servers configured with Corosync + Pacemaker).
The servers have more than 100 CPU cores and 128 GB+ RAM.
Every 3–4 weeks, one of the cluster nodes stops sending ESP packets.
Incoming encrypted ESP packets continue to arrive and are successfully decrypted. IKEv2 re-establishes the tunnels correctly, XFRM policies and states remain intact, routing tables are correct, and nothing unusual appears in dmesg.
However, **all outbound ESP drops to zero**.
Firewall counters confirm:
- ESP input: normal
- ESP output: zero during the failure state
Restarting the affected HA node triggers failover and temporarily resolves the issue.
### Additional observation (IMPORTANT):
We capture traffic every 15 minutes on all interfaces. In the two most recent incidents, immediately before the ESP output failure occurred, tcpdump mis-reported the input/output interface.
Instead of the correct interface (ETH3), tcpdump reported usb0 (ILO) or when I disabled usb0 it showed unknown for in/out interface.
Interface counters confirm that usb0 carries almost no traffic, so the tcpdump interface attribution appears incorrect.
This raises the possibility of:
- an XFRM output path regression,
- an skb device pointer corruption,
- a routing decision inconsistency,
- or a driver-layer issue affecting interface reporting and ESP output.
Upgrading from kernel **6.8.0.52 → 6.8.0.85** did not resolve the issue.
We would appreciate guidance on additional instrumentation or whether this matches any known recent regressions.
[3.] Keywords:
IPsec, XFRM, ESP, StrongSwan, routing, skb, tcpdump, network stack, HA cluster
[4.] Kernel information
[4.1.] Kernel version (/proc/version):
Linux version 6.8.0-85-generic (buildd@lcy02-amd64-024) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04.2) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #85~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Fri Sep 19 16:18:59 UTC 2
[4.2.] Kernel .config:
is attached
[5.] Most recent kernel version which did not have the bug:
Unknown.
The issue is present in both:
- 6.8.0.52
- 6.8.0.85
[6.] Output of Oops messages:
None. No crashes or warnings in dmesg.
[7.] Example program/script to reproduce:
No minimal reproducer.
Issue appears after several weeks while handling ~200 active IPsec tunnels.
Periodic tcpdump + XFRM/SA dumps available upon request.
[8.] Environment
[8.1.] Software (ver_linux output):
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.5 LTS
Release: 22.04
Codename: jammy
[8.2.] Processor information (/proc/cpuinfo):
Interl(R) Xeon(R) Gold 6230R CPU @ 2.10GH (104 cores)
[8.3.] Module information (/proc/modules):
...
[8.4.] Loaded driver and hardware information:
....
[8.5.] PCI information (lspci -vvv):
...
[8.6.] SCSI information (/proc/scsi/scsi):
....
[8.7.] Additional relevant information:
- ~200 IKEv2 tunnels via StrongSwan
- XFRM policies/states valid during failure
- Incoming ESP continues to decrypt
- Outgoing ESP stops completely
- tcpdump reports wrong interface (usb0 instead of ETH3) shortly before failure
- NIC is HP server onboard interface
- HA failover restores functionality temporarily
[X.] Other notes, patches, workarounds:
Restarting the affected node forces HA failover and restores traffic temporarily.
Kernel upgrade did not solve the issue.
StrongSwan logs show no IKE or CHILD_SA issues.