Find no outgoing routing table entry for CIFS reconnect?

From: Dexuan Cui
Date: Mon Jan 10 2022 - 20:29:31 EST


Hi, all,
I'm investigating a Linux networking issue: inside a Linux container, the
Linux network stack fails to find an outgoing routing table entry for the
CIFS module's TCP request; however, inside the same container, I'm able to
connect to the same CIFS server by "telnet cifs-server 445"! I think the
kernel CIFS module and the userspace "telnet" program should share the
same network namespace in the same container, so they should be using the
same routing table? It's unclear why the CIFS-initiated outgoing TCP
connect fails to find a routing table entry. Anyone happens to know about
such a bug?

Here I'm unable to reproduce the issue at will, but from time to time some
container suddenly starts to hit the issue after it has been working fine
several days, and the user starts to complain that a mounted CIFS folder
becomes inaccessible due to -ENETUNREACH (-101), and only reboot
can work around the issue temporarily, and the issue might re-occur later.

Here the VM kernel is 5.4.0-1064-azure [1], and I don't know if the mainline
has the issue or not. Here I debugged the issue using ftrace and bpftrace
in a VM/container that was showing the issue, and the -ENETUNREACH
error happens this way:

tcp_v4_connect
ip_route_connect
__ip_route_output_key
ip_route_output_key_hash
ip_route_output_key_hash_rcu
fib_lookup


static inline int fib_lookup(struct net *net, const struct flowi4 *flp,
struct fib_result *res, unsigned int flags)
{
struct fib_table *tb;
int err = -ENETUNREACH;

rcu_read_lock();

tb = fib_get_table(net, RT_TABLE_MAIN);
if (tb)
err = fib_table_lookup(tb, flp, res, flags | FIB_LOOKUP_NOREF);

if (err == -EAGAIN)
err = -ENETUNREACH;

rcu_read_unlock();

return err;
}

The above fib_table_lookup() returne -EAGAIN (-11), which is converted
to -ENETUNREACH.

The code of fib_table_lookup() is complicated [1] and the pre-defined
tracepoint in the function doesn't reveal why the cifs kernel thread fails
to find an outgoing routing table entry while the telnet program can find
the entry:

cifsd-4809 [001] .... 94040.997416: fib_table_lookup: table 254 oif 0 iif 1 proto 6 0.0.0.0/0 -> 10.10.166.38/445 tos 0 scope 0 flags 0 ==> dev - gw 0.0.0.0/:: err -11
telnet-4195 [003] .... 94041.005634: fib_table_lookup: table 254 oif 0 iif 1 proto 6 0.0.0.0/0 -> 10.10.166.38/445 tos 16 scope 0 flags 0 ==> dev eth0 gw 10.133.162.1/:: err 0
telnet-4195 [003] .... 94041.005638: fib_table_lookup: table 254 oif 0 iif 1 proto 6 10.133.162.32/0 -> 10.10.166.38/445 tos 16 scope 0 flags 0 ==> dev eth0 gw 10.133.162.1/:: err 0
telnet-4195 [003] .... 94041.005643: fib_table_lookup: table 254 oif 0 iif 1 proto 6 10.133.162.32/41670 -> 10.10.166.38/445 tos 16 scope 0 flags 0 ==> dev eth0 gw 10.133.162.1/:: err

I was trying to check the input parameters of the related functions using
bpftrace, but unluckily I lost the repro as the VM was rebooted by accident.

It would be great to have your insights while I'm waiting for a new repro.

Thanks!
-- Dexuan

[1] https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-azure/+git/bionic/tree/net/ipv4/fib_trie.c?h=Ubuntu-azure-5.4-5.4.0-1064.67_18.04.1#n1312