Re: [REGRESSION] connection timeout with routes to VRF

From: Mike Manning
Date: Sun Jun 26 2022 - 16:06:59 EST


On 11/06/2022 17:44, David Ahern wrote:
> On 6/11/22 5:14 AM, Jan Luebbe wrote:
>> Hi,
>>
>> TL;DR: We think we have found a regression in the handling of VRF route leaking
>> caused by "net: allow binding socket in a VRF when there's an unbound socket"
>> (3c82a21f4320).
> This is the 3rd report in the past few months about this commit.
>
> ...
>
>> Our minimized test case looks like this:
>> ip rule add pref 32765 from all lookup local
>> ip rule del pref 0 from all lookup local
>> ip link add red type vrf table 1000
>> ip link set red up
>> ip route add vrf red unreachable default metric 8192
>> ip addr add dev red 172.16.0.1/24
>> ip route add 172.16.0.0/24 dev red
>> ip vrf exec red socat -dd TCP-LISTEN:1234,reuseaddr,fork SYSTEM:"echo connected" &
>> sleep 1
>> nc 172.16.0.1 1234 < /dev/null
>>
> ...
> Thanks for the detailed analysis and reproducer.
>
>> The partial revert
>> diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
>> index 98e1ec1a14f0..41e7f20d7e51 100644
>> --- a/include/net/inet_hashtables.h
>> +++ b/include/net/inet_hashtables.h
>> @@ -310,8 +310,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
>>  #define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
>>         (((__sk)->sk_portpair == (__ports))                     &&      \
>>          ((__sk)->sk_addrpair == (__cookie))                    &&      \
>> -        (((__sk)->sk_bound_dev_if == (__dif))                  ||      \
>> -         ((__sk)->sk_bound_dev_if == (__sdif)))                &&      \
>> +        (!(__sk)->sk_bound_dev_if      ||                              \
>> +          ((__sk)->sk_bound_dev_if == (__dif))                 ||      \
>> +          ((__sk)->sk_bound_dev_if == (__sdif)))               &&      \
>>          net_eq(sock_net(__sk), (__net)))
>>  #else /* 32-bit arch */
>>  #define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
>> @@ -321,8 +322,9 @@ static inline struct sock *inet_lookup_listener(struct net *net,
>>         (((__sk)->sk_portpair == (__ports))             &&              \
>>          ((__sk)->sk_daddr      == (__saddr))           &&              \
>>          ((__sk)->sk_rcv_saddr  == (__daddr))           &&              \
>> -        (((__sk)->sk_bound_dev_if == (__dif))          ||              \
>> -         ((__sk)->sk_bound_dev_if == (__sdif)))        &&              \
>> +        (!(__sk)->sk_bound_dev_if      ||                              \
>> +          ((__sk)->sk_bound_dev_if == (__dif))         ||              \
>> +          ((__sk)->sk_bound_dev_if == (__sdif)))       &&              \
>>          net_eq(sock_net(__sk), (__net)))
>>  #endif /* 64-bit arch */
>>
>> restores the original behavior when applied on v5.18. This doesn't apply
>> directly on master, as the macro was replaced by an inline function in "inet:
>> add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()" (4915d50e300e).
>>
>> I have to admit I don't quite understand 3c82a21f4320, so I'm not sure how to
>> proceed. What would be broken by the partial revert above? Are there better ways
>> to configure routing into the VRF than simply "ip route add 172.16.0.0/24 dev
>> red" that still work?
>>
>> Thanks,
>> Jan
>>
>> #regzbot introduced: 3c82a21f4320
>>
>>
>>
> Andy Roulin suggested the same fix to the same problem a few weeks back.
> Let's do it along with a test case in fcnl-test.sh which covers all of
> these vrf permutations.
>
Reverting 3c82a21f4320 would remove isolation between the default and other VRFs
needed when no VRF route leaking has been configured between these: there may be
unintended leaking of packets arriving on a device enslaved to an l3mdev due to
the potential match on an unbound socket.

VRF route leaking requires routes to be present for both ingress and egress VRFs,
the testcase shown only has a route from default to red VRF. The implicit return
path from red to default VRF due to match on unbound socket is no longer present.

Match on unbound socket in all VRFs and not only in the default VRF should be
possible by setting this option (see
https://www.kernel.org/doc/Documentation/networking/vrf.txt):

sysctl net.ipv4.tcp_l3mdev_accept=1

However, for this to work a change similar to the following is needed (I have
shown the change to the macro for consistency with above, it is now an inline fn):

---
include/net/inet_hashtables.h | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -300,9 +300,8 @@
#define INET_MATCH(__sk, __net, __cookie, __saddr, __daddr, __ports, __dif, __sdif) \
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_addrpair == (__cookie)) && \
- (((__sk)->sk_bound_dev_if == (__dif)) || \
- ((__sk)->sk_bound_dev_if == (__sdif))) && \
- net_eq(sock_net(__sk), (__net)))
+ net_eq(sock_net(__sk), (__net)) && \
+ inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif), (__sdif)))
#else /* 32-bit arch */
#define INET_ADDR_COOKIE(__name, __saddr, __daddr) \
const int __name __deprecated __attribute__((unused))
@@ -311,9 +310,8 @@
(((__sk)->sk_portpair == (__ports)) && \
((__sk)->sk_daddr == (__saddr)) && \
((__sk)->sk_rcv_saddr == (__daddr)) && \
- (((__sk)->sk_bound_dev_if == (__dif)) || \
- ((__sk)->sk_bound_dev_if == (__sdif))) && \
- net_eq(sock_net(__sk), (__net)))
+ net_eq(sock_net(__sk), (__net)) && \
+ inet_sk_bound_dev_eq((__net), (__sk)->sk_bound_dev_if, (__dif), (__sdif)))
#endif /* 64-bit arch */

/* Sockets in TCP_CLOSE state are _always_ taken out of the hash, so we need


This is to get the testcase to pass, I will leave it to others to comment on
the testcase validity in terms of testing forwarding using commands on 1 device.

The series that 3c82a21f4320 is part of were introduced into the kernel in 2018
by the Vyatta team, who regularly run an extensive test suite for routing protocols
for VRF functionality incl. all combinations of route leaking between default and
other VRFs, so there is no known issue in this regard. I will attempt to reach out
to them so as to advise them of this thread.

Thanks
Mike