Re: [PATCH net-next v19 00/26] Introducing OpenVPN Data Channel Offload

From: Antonio Quartulli
Date: Fri Feb 14 2025 - 08:54:35 EST


On 13/02/2025 20:40, Antonio Quartulli wrote:
On 13/02/2025 16:46, Sabrina Dubroca wrote:
2025-02-13, 12:46:34 +0100, Antonio Quartulli wrote:
On 13/02/2025 00:34, Sabrina Dubroca wrote:
Hello,

2025-02-11, 01:39:53 +0100, Antonio Quartulli wrote:
All minor and major reported problems have been finally addressed.
Big thanks to Sabrina, who took the time to guide me through
converting the peer socket to an RCU pointer.

Something is off (not sure if it's new to this version): if I use
test-tcp.sh to setup a set of interfaces and peers (I stop the test
just after setup to keep the environment alive), then remove all netns
with "ip -all netns delete", I expect all devices to go away, but they
don't. With debug messages enabled I'm seeing some activity from the
module ("tun0: sending keepalive to peer 3" and so on), and
ovpn_net_uninit/ovpn_priv_free never got called.

I can reproduce it. If later I rmmod ovpn I then get all the "Deleting peer"
messages.
So instances are not being purged on netns exit.

Will dive into it.

I think the socket holds a ref on the netns, so it's not getting
destroyed, simply "removed" from iproute's point of view. And the
socket isn't going away as long as it's used by a peer.

After a deep dive I was getting to the same conclusion:
cleanup_net() is never invoked because we lack an invocation of put_net().
sk_alloc() invokes get_net(), so this is the ref that is not being released.


If I delete the peer(s) for the ovpn device and then the netns it was
in, the netns is fully removed, and the ovpn device is gone. Also no
issue if I delete the ovpn device before its netns, then everything is
destroyed as expected.

I'm not sure that can be solved, as least under the current refcount
scheme.

I went back to v12 of the patchset (which is pre-refcount-restructuring) and indeed the problem there doesn't exist.

However, it's unclear to me how in v12 the socket release was happening upon netns delete. Who was triggering that? ovpn still needed to call sockfd_put() in order to let it go.

Will investigate some more and think about a solution.


I may have done something wrong, but today I tried again and I am reproducing this issue also on v8 + 6.11.

I am indeed wondering how it could have ever worked: if we don't delete the peer, we don't detach the socket, hence the following chain:

sockfd_put() -> sk_destructor() -> put_net()

does not happen.

Shouldn't we be notified somehow that the netns is going away?

For example in wireguard/device.c the socket is released in pernet_operations.pre_exit().

But pre_exit() is invoked in cleanup_net(), which is invoked ONLY if the net refcount has reached 0...but how can it be zero before the sockets have been released?

I must be missing something, because this seems to be a reference loop.


I'll keep digging..

Regards,


--
Antonio Quartulli
OpenVPN Inc.