Re: [REGRESSION] connection timeout with routes to VRF

From: Jan Luebbe
Date: Wed Jul 06 2022 - 14:48:57 EST


On Sun, 2022-06-26 at 21:06 +0100, Mike Manning wrote:
...
> Andy Roulin suggested the same fix to the same problem a few weeks back.
> Let's do it along with a test case in fcnl-test.sh which covers all of
> these vrf permutations.
>
Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
needed when no VRF route leaking has been configured between these: there may be
unintended leaking of packets arriving on a device enslaved to an l3mdev due to
the potential match on an unbound socket.

Thanks for the explanation.

VRF route leaking requires routes to be present for both ingress and egress
VRFs,
the testcase shown only has a route from default to red VRF. The implicit return
path from red to default VRF due to match on unbound socket is no longer
present.


If there is a better configuration that makes this work in the general case
without a change to the kernel, we'd be happy as well.

In our full setup, the outbound TCP connection (from the default VRF) gets a
local IP from the interface enslaved to the VRF. Before 3c82a21f4320, this would
simply work.

How would the return path route from the red VRF to the default VRF look in that
case?

Match on unbound socket in all VRFs and not only in the default VRF should be
possible by setting this option (see
https://www.kernel.org/doc/Documentation/networking/vrf.txt):


Do you mean unbound as in listening socket not bound to an IP with bind()? Or as
in a socket in the default VRF?

sysctl net.ipv4.tcp_l3mdev_accept=1


The sysctl docs sound like this should only apply to listening sockets. In this
case, we have an unconnected outbound socket.

However, for this to work a change similar to the following is needed (I have
shown the change to the macro for consistency with above, it is now an inline
fn):


I can also test on master and only used the macro form only because I wasn't
completely sure how to translate it to the inline function form.

---
 include/net/inet_hashtables.h |   10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -300,9 +300,8 @@
 #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif,
__sdif) \
        (((__sk)->sk_portpair == (__ports))                     &&      \
         ((__sk)->sk_addrpair == (__cookie))                    &&      \
-        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
-         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
-        net_eq(sock_net(__sk), (__net)))
+        net_eq(sock_net(__sk), (__net))                        &&      \
+        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
(__sdif)))
 #else /* 32-bit arch */
 #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
        const int __name __deprecated __attribute__((unused))
@@ -311,9 +310,8 @@
        (((__sk)->sk_portpair == (__ports))             &&              \
         ((__sk)->sk_daddr      == (__saddr))           &&              \
         ((__sk)->sk_rcv_saddr  == (__daddr))           &&              \
-        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
-         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
-        net_eq(sock_net(__sk), (__net)))
+        net_eq(sock_net(__sk), (__net))                &&              \
+        inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif),
(__sdif)))
 #endif /* 64-bit arch */

 /* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need

I can confirm that this gets my testcase working with
net.ipv4.tcp_l3mdev_accept=1.

This is to get the testcase to pass, I will leave it to others to comment on
the testcase validity in terms of testing forwarding using commands on 1 device.

So a network-namespace-based testcase would be preferred? We used the simple
setup because it seemed easier to understand.

The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
by the Vyatta team, who regularly run an extensive test suite for routing
protocols
for VRF functionality incl. all combinations of route leaking between default
and
other VRFs, so there is no known issue in this regard. I will attempt to reach
out
to them so as to advise them of this thread.

Are these testcases public? Perhaps I could use them find a better configuration
that handles our use-case.

Thanks,

Jan