Re: [PATCH net-next v9] l2tp: fix double dst_release() on sk_dst_cache race

From: Paolo Abeni

Date: Thu Jun 04 2026 - 09:40:04 EST

From: AI Reviewer <ai@xxxxxxxxxxx>

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.
Full review at: https://netdev-ai.bots.linux.dev/sashiko/
---
l2tp: fix double dst_release() on sk_dst_cache race

This commit reworks the L2TP transmit path to stop using
inet6_csk_xmit() for IPv6 and to pre-route IPv4 packets via a new
helper, with the goal of avoiding a double dst_release() race between
sk_dst_check() and __sk_dst_check() on sk->sk_dst_cache observed under
pppol2tp traffic. New helpers l2tp_xmit_ipv6() and l2tp_xmit_ipv4()
replace the former inet6_csk_xmit() / ip_queue_xmit() calls in
l2tp_xmit_queue().

A few questions on the new helpers below.

> diff --git a/net/l2tp/l2tp_core.c b/net/l2tp/l2tp_core.c
> index f940914959b1..54e837fec201 100644
> --- a/net/l2tp/l2tp_core.c
> +++ b/net/l2tp/l2tp_core.c
> @@ -1205,19 +1205,114 @@ static int l2tp_build_l2tpv3_header(struct l2tp_session *session, void *buf)
> return bufp - optr;
> }
>
> +#if IS_ENABLED(CONFIG_IPV6)
> +static int l2tp_xmit_ipv6(struct sock *sk, struct sk_buff *skb)
> +{
> + struct ipv6_pinfo *np = inet6_sk(sk);
> + struct inet_sock *inet = inet_sk(sk);
> + struct in6_addr *final_p, final;
> + struct ipv6_txoptions *opt;
> + struct dst_entry *dst;
> + struct flowi6 fl6;
> + int err;

[ ... ]

> + dst = ip6_sk_dst_lookup_flow(sk, &fl6, final_p, true);
> + if (IS_ERR(dst)) {
> + rcu_read_unlock();
> + kfree_skb(skb);
> + return NET_XMIT_DROP;
> + }

[Medium]
Should this error path return a negative errno rather than
NET_XMIT_DROP?

l2tp_xmit_queue() converts the result of this function with:

return err >= 0 ? NET_XMIT_SUCCESS : NET_XMIT_DROP;

NET_XMIT_DROP is 0x01 (a positive value), so the wrapper takes the
err >= 0 branch and returns NET_XMIT_SUCCESS even though the skb has
already been freed via kfree_skb(). l2tp_xmit_skb() then bumps
session/tunnel tx_packets and tx_bytes (rather than tx_errors) for a
packet that never went out.

The IPv4 sibling l2tp_xmit_ipv4() returns -EHOSTUNREACH on the
analogous failure, and the inet6_csk_xmit() being replaced returned
PTR_ERR(dst) (negative) here. Returning PTR_ERR(dst) or -EHOSTUNREACH
would let the wrapper map this to NET_XMIT_DROP and have
l2tp_xmit_skb() count it as an error.

> +
> + skb_dst_set(skb, dst);
> + fl6.daddr = sk->sk_v6_daddr;
> +
> + err = ip6_xmit(sk, skb, &fl6, READ_ONCE(sk->sk_mark),
> + opt, np->tclass,
> + READ_ONCE(sk->sk_priority));
> + rcu_read_unlock();
> + return err;
> +}
> +#endif
> +
> +static int l2tp_xmit_ipv4(struct sock *sk, struct sk_buff *skb, struct flowi *fl)
> +{
> + struct inet_sock *inet = inet_sk(sk);
> + struct net *net = sock_net(sk);
> + struct ip_options_rcu *inet_opt;
> + struct flowi4 *fl4;
> + struct rtable *rt;
> + __u8 tos;
> + int err;
> +
> + rcu_read_lock();
> + inet_opt = rcu_dereference(inet->inet_opt);
> + fl4 = &fl->u.ip4;
> + tos = READ_ONCE(inet->tos);
> +
> + rt = dst_rtable(sk_dst_check(sk, 0));

[High]
Does the cache-hit path leak a reference on the cached dst?

sk_dst_check() ultimately calls sk_dst_get(), which does
rcuref_get(&dst->__rcuref) and returns a dst with a caller-owned +1
reference.

> + if (!rt) {
> + __be32 daddr = inet->inet_daddr;
> +
> + if (inet_opt && inet_opt->opt.srr)
> + daddr = inet_opt->opt.faddr;
> +
> + rt = ip_route_output_ports(net, fl4, sk,
> + daddr, inet->inet_saddr,
> + inet->inet_dport,
> + inet->inet_sport,
> + sk->sk_protocol,
> + tos & INET_DSCP_MASK,
> + READ_ONCE(sk->sk_bound_dev_if));
> + if (IS_ERR(rt)) {
> + rcu_read_unlock();
> + IP_INC_STATS(net, IPSTATS_MIB_OUTNOROUTES);
> + kfree_skb_reason(skb, SKB_DROP_REASON_IP_OUTNOROUTES);
> + return -EHOSTUNREACH;
> + }
> +
> + sk_setup_caps(sk, &rt->dst);
> + }
> +
> + skb_dst_set_noref(skb, &rt->dst);
> + rcu_read_unlock();
> +
> + err = ip_queue_xmit(sk, skb, fl);
> + return err;
> +}

[High]
Continuing on the cache-hit path: skb_dst_set_noref() stores the
pointer with SKB_DST_NOREF and explicitly does not consume a
reference, and ip_queue_xmit()/__ip_queue_xmit() only drop noref-
flagged dsts via skb_dst_drop()/refdst_drop without releasing a
reference. The +1 reference returned by sk_dst_check() is never put.

For comparison, net/ipv4/udp.c:udp_sendmsg() uses the same idiom and
explicitly calls ip_rt_put(rt) at its out: label to release this
reference. The previous behaviour relied on __ip_queue_xmit() using
__sk_dst_check() (the no-ref variant), which avoids this issue. Should
this helper also call ip_rt_put(rt) (or equivalent) on its way out?

Note that sk_dst_reset() drops only the cache's own ref, so the
leaked refs persist past socket close and the cached dst is never
freed, which appears to make this an unbounded leak under user-driven
sendmsg loops.

[Critical]
On the cache-miss path, can the rtable be freed before
__ip_queue_xmit() dereferences it?

ip_route_output_ports() returns rt with a +1 reference. sk_setup_caps()
calls sk_dst_set(), which xchg's sk_dst_cache and dst_releases the old
entry but does not take a new reference on the new rt; the caller's +1
ref is consumed by being installed into the cache slot.

After sk_setup_caps(), rt's only liveness anchor is sk->sk_dst_cache.
The function then does:

skb_dst_set_noref(skb, &rt->dst);
rcu_read_unlock();

err = ip_queue_xmit(sk, skb, fl);

__ip_queue_xmit() takes its own rcu_read_lock(), starting a new
critical section. In the gap between rcu_read_unlock() here and the
new rcu_read_lock() inside __ip_queue_xmit(), an RCU grace period can
complete on this CPU.

A concurrent sk_dst_reset() (from a route update, sockopt, or another
lockless caller observing the dst as obsolete, similar to the
udpv6_sendmsg pattern described in the commit message) can xchg
sk_dst_cache to NULL and dst_release(rt), dropping the refcount 1->0
and queueing the rtable for free via call_rcu_hurry / dst_destroy_rcu.
If the grace period elapses in the rcu_read_unlock window, the rtable
is freed before __ip_queue_xmit() dereferences skb_rtable(skb)
(rt_uses_gateway, rt->dst, ...) at the packet_routed label.

skb_dst_set_noref() requires an unbroken RCU read-side critical
section spanning the entire skb consumption, which is the invariant
__ip_queue_xmit() itself maintains (its rcu_read_lock() spans
skb_dst_set_noref through ip_local_out). Should this helper either
hold rcu_read_lock() across the ip_queue_xmit() call, or take a real
reference (skb_dst_set() with dst_hold(), or skip the noref path) so
that rt cannot be freed during the handoff?
--
This is an AI-generated review.