[PATCH v1 13/13] libceph: force host network namespace for kernel CephFS mounts

From: Ionut Nechita (Wind River)

Date: Thu Mar 12 2026 - 04:21:24 EST

From: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>

In containerized environments (e.g., Rook-Ceph CSI with
forcecephkernelclient=true), the mount() syscall
for kernel CephFS may be invoked from a pod's network namespace
instead of the host namespace. This happens despite the CSI node
plugin (csi-cephfsplugin) running with hostNetwork: true, due to
race conditions during kubelet restart or pod scheduling.

ceph_messenger_init() captures current->nsproxy->net_ns at mount
time and uses it for all subsequent socket operations. When a pod
NS is captured, all kernel ceph sockets (mon, mds, osd) are
created in that namespace, which typically lacks routes to the
Ceph monitors (e.g., fd04:: ClusterIP addresses).
This causes permanent EADDRNOTAVAIL (-99) on every connection
attempt at ip6_dst_lookup_flow(), with no possibility of recovery
short of force-unmount and remount from the correct namespace.

Root cause confirmed via kprobe tracing on ip6_dst_lookup_flow:
the net pointer passed to the routing lookup was the pod's
net_ns (0xff367a0125dd5780) instead of init_net
(0xffffffffbda76940). The pod NS had no route for fd04::/64
(monitor ClusterIP range), while userspace python connect() from
the same host succeeded because it ran in host NS.

Fix this by always using init_net (the host network namespace)
in ceph_messenger_init(). The kernel CephFS client inherently
requires host-level network access to reach Ceph monitors, OSDs,
and MDS daemons. Using the caller's namespace was inherited from
generic socket patterns but is incorrect for a kernel filesystem
client that must survive beyond the lifetime of the mounting
process and its network namespace.

A warning is logged when a mount from a non-init namespace is
detected, to aid debugging.

Observed in production (kernel 6.12.0-1-rt-amd64, Ceph Reef
18.2.5, IPv6-only cluster, ceph-csi v3.13.1):
- Fresh boot of compute-0, ceph-csi mounts CephFS via kernel
- All monitor connections fail with EADDRNOTAVAIL immediately
- kprobe confirms wrong net_ns in ip6_dst_lookup_flow
- Workaround: umount -l + systemctl restart kubelet
- After restart: mount captures host NS, works immediately

Signed-off-by: Ionut Nechita <ionut.nechita@xxxxxxxxxxxxx>
---
net/ceph/messenger.c | 27 ++++++++++++++++++++++++++-
1 file changed, 26 insertions(+), 1 deletion(-)

diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 8165e6a8fe092..a2e8ea6d339c9 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1791,7 +1791,32 @@ void ceph_messenger_init(struct ceph_messenger *msgr,

atomic_set(&msgr->stopping, 0);
atomic_set(&msgr->addr_notavail_count, 0);
- write_pnet(&msgr->net, get_net(current->nsproxy->net_ns));
+
+ /*
+ * Use the initial (host) network namespace instead of the
+ * caller's current namespace. In containerized environments
+ * (e.g., Rook-Ceph CSI with forcecephkernelclient=true), the
+ * mount() syscall may be invoked from a pod's network namespace
+ * even when the CSI plugin runs with hostNetwork: true (race
+ * conditions during kubelet restart, pod scheduling, etc.).
+ *
+ * If the pod NS is captured here, all kernel ceph sockets will
+ * be created in that NS, which typically lacks routes to the
+ * Ceph monitors (e.g., fd04:: ClusterIP addresses). This causes
+ * permanent EADDRNOTAVAIL on every connection attempt with no
+ * possibility of recovery short of force-unmount + remount.
+ *
+ * The kernel CephFS client always needs host-level network
+ * access to reach Ceph monitors, OSDs, and MDS daemons, so
+ * using init_net is the correct choice. The previous behavior
+ * of capturing current->nsproxy->net_ns was inherited from
+ * generic socket code but is wrong for a kernel filesystem
+ * client that must survive beyond the lifetime of the mounting
+ * process's network namespace.
+ */
+ if (current->nsproxy->net_ns != &init_net)
+ pr_warn("libceph: mount from non-init network namespace detected, using host namespace instead\n");
+ write_pnet(&msgr->net, get_net(&init_net));

dout("%s %p\n", __func__, msgr);
}
--
2.53.0