Re: [PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

From: Ionut Nechita (Wind River)

Date: Thu Apr 02 2026 - 13:13:42 EST


Hi Ilya,

Following up with the additional data I promised. I reproduced the
issue on a fresh cluster and have concrete evidence of the namespace
problem.

Environment (4-node cluster, 2 controllers + 2 workers):

$ kubectl get nodes
NAME STATUS ROLES KERNEL-VERSION
compute-0 Ready <none> 6.12.0-1-rt-amd64
compute-1 Ready <none> 6.12.0-1-rt-amd64
controller-0 Ready control-plane 6.12.0-1-amd64
controller-1 Ready control-plane 6.12.0-1-amd64

- OS: Debian GNU/Linux 11 (bullseye)
- Container runtime: containerd 1.7.27
- Kubernetes: v1.29.2
- Rook: v1.16.6
- ceph-csi: v3.13.1
- Ceph: 18.2.5 (Reef)
- Network: Calico + Multus, IPv6-only
- Pod CIDR: dead:beef::/64 (Calico, vxlanMode: Never)
- Service CIDR: fd04::/112
- CSI_FORCE_CEPHFS_KERNEL_CLIENT: true
- CSI_ENABLE_HOST_NETWORK: true

The scenario is a Rook-Ceph rolling upgrade (Ceph 18.2.2 -> 18.2.5).
During the upgrade, Rook recreates the CSI DaemonSet pods and various
Ceph daemon pods (MON, MDS, OSD). Kubelet then needs to remount
CephFS volumes for workload pods on the node.

After the upgrade, the kernel ceph client is stuck with permanent
EADDRNOTAVAIL (-99) on all monitor connections:

libceph: connect (1)[fd04::652b]:6789 error -99
libceph: mon0 (1)[fd04::652b]:6789 connect error

The monitors are Kubernetes ClusterIP services:

rook-ceph-mon-a ClusterIP fd04::652b 6789/TCP,3300/TCP
rook-ceph-mon-b ClusterIP fd04::c0e7 6789/TCP,3300/TCP
rook-ceph-mon-c ClusterIP fd04::1981 6789/TCP,3300/TCP

Here is the key evidence. The kernel ceph client debugfs status shows:

$ cat /sys/kernel/debug/ceph/*/status
instance: client.374328 (3)[dead:beef::a2bf:c94c:345d:bc66]:0

The source address dead:beef::a2bf:c94c:345d:bc66 is from the Calico
pod CIDR (dead:beef::/64). This address does NOT belong to any
currently running pod on the node. I enumerated all active CNI
namespaces:

$ for ns in $(ip netns list | awk '{print $1}'); do
ip netns exec $ns ip -6 addr show | grep dead:beef
done

...bc6d kube-sriov-cni-ds
...bc70 stx-centos
...bc73 rook-ceph-mon-a
...bc74 rook-ceph-crashcollector
...bc75 rook-ceph-exporter
...bc76 rook-ceph-mgr-c
...bc78 rook-ceph-osd-0

Address ...bc66 is not present in any existing namespace. The pod
that owned it was destroyed during the upgrade, and Calico removed
its veth interfaces during CNI cleanup.

Meanwhile, the CSI plugin pod is correctly in the host namespace:

$ kubectl exec csi-cephfsplugin-gdrqr -c csi-cephfsplugin \
-- readlink /proc/1/ns/net
net:[4026531840]

$ readlink /proc/1/ns/net # on host
net:[4026531840]

And from host userspace, connecting to the same ClusterIP monitors
works fine (goes through kube-proxy iptables DNAT):

$ python3 -c "import socket; s=socket.socket(socket.AF_INET6, \
socket.SOCK_STREAM); s.connect(('fd04::652b', 6789)); print('OK')"
OK

But ping6 from host fails (ICMP not NAT'd by kube-proxy):

$ ping6 -c1 fd04::652b
From fdff:719a:bf60:4008::46e icmp_seq=1 Destination unreachable: No route

So the situation is:
1. The kernel ceph client captured a pod network namespace at mount
time (source address from dead:beef::/64 proves this)
2. That pod was later destroyed during the upgrade
3. Calico tore down the veth interfaces in that namespace
4. The namespace persists (ref-counted by ceph) but has no
interfaces or routes -- it is a zombie namespace
5. All kernel ceph connect() calls fail with EADDRNOTAVAIL
6. No recovery is possible without force-unmount + remount

As you noted, this is the "orchestration tearing down the relevant
virtual network devices prematurely" scenario. The namespace is kept
alive by the ceph reference, but it becomes non-functional.

I'm still investigating exactly how mount.ceph ends up in a pod
namespace despite the CSI plugin having hostNetwork: true. I have a
monitoring script set up to capture the namespace of mount.ceph
processes during the next upgrade attempt. I suspect it happens
during the brief window when the old CSI pod is terminated and the
new one is not yet ready, but kubelet still attempts to mount
volumes. I'll follow up with that data.

I've also filed this on the Ceph tracker:
https://tracker.ceph.com/issues/74897

Thanks,
Ionut