Re: [PATCH bpf] bpf,tcp: avoid infinite recursion in BPF_SOCK_OPS_HDR_OPT_LEN_CB

From: KaFai Wan

Date: Wed Apr 15 2026 - 16:48:56 EST

On Wed, 2026-04-15 at 11:55 -0700, Martin KaFai Lau wrote:
> On Tue, Apr 14, 2026 at 06:57:00PM +0800, Jiayuan Chen wrote:
> > A BPF_PROG_TYPE_SOCK_OPS program can set BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG
> > to inject custom TCP header options. When the kernel builds a TCP packet,
> > it calls tcp_established_options() to calculate the header size, which
> > invokes bpf_skops_hdr_opt_len() to trigger the BPF_SOCK_OPS_HDR_OPT_LEN_CB
> > callback.
> >
> > If the BPF program calls bpf_setsockopt(TCP_NODELAY) inside this callback,
> > __tcp_sock_set_nodelay() will call tcp_push_pending_frames(), which calls
> > tcp_current_mss(), which calls tcp_established_options() again,
> > re-triggering the same BPF callback. This creates an infinite recursion
> > that exhausts the kernel stack and causes a panic.
> >
> > BPF_SOCK_OPS_HDR_OPT_LEN_CB
> > -> bpf_setsockopt(TCP_NODELAY)
> > -> tcp_push_pending_frames()
> > -> tcp_current_mss()
> > -> tcp_established_options()
> > -> bpf_skops_hdr_opt_len()
> > /* infinite recursion */
> > -> BPF_SOCK_OPS_HDR_OPT_LEN_CB
> >
> > A similar reentrancy issue exists for TCP congestion control, which is
> > guarded by tp->bpf_chg_cc_inprogress. Adopt the same approach: introduce
> > tp->bpf_hdr_opt_len_cb_inprogress, set it before invoking the callback in
> > bpf_skops_hdr_opt_len(), and check it in sol_tcp_sockopt() to reject
> > bpf_setsockopt(TCP_NODELAY) calls that would trigger
> > tcp_push_pending_frames() and cause the recursion.
> >
> > Reported-by: Quan Sun <2022090917019@xxxxxxxxxxxxxxxx>
> > Reported-by: Yinhao Hu <dddddd@xxxxxxxxxxx>
> > Reported-by: Kaiyan Mei <M202472210@xxxxxxxxxxx>
> > Reported-by: Dongliang Mu <dzm91@xxxxxxxxxxx>
> > Closes: https://lore.kernel.org/bpf/d1d523c9-6901-4454-a183-94462b8f3e4e@xxxxxxxxxxxxxxxx/
>
> Thanks for the report and fixes suggested across different threads.
>
> Using has_current_bpf_ctx() to avoid tcp_push_pending_frames() should
> work but it may change the expectation for bpf_setsockopt(TCP_NODELAY).
> e.g. A bpf_tcp_iter does bpf_setsockopt(TCP_NODELAY).
>
> Adding another bit in the tcp_sock is not ideal either. I agree with
> Alexei that it is better to reuse the existing bit if we go down this path.
> We also need to audit more closely if there are cases that two different
> type of bpf progs may call bpf_setsockopt(). e.g.
> bpf_tcp_iter does bpf_setsockopt(TCP_CONGESTION) to switch
> to a bpf_tcp_cc and the new bpf_tcp_cc->init() will also do
> bpf_setsockopt(xxx) which then will be rejected.
>
> Another fix could be, the bpf_setsockopt(TCP_NODELAY) is always broken
> for BPF_SOCK_OPS_HDR_OPT_LEN_CB and BPF_SOCK_OPS_WRITE_HDR_OPT_CB unless
> the bpf prog is doing some maneuver to avoid the recursion. Thus,
> this use case is basically broken as is and I don't see a use case
> for bpf_setsockopt(TCP_NODELAY) when writing header also.
> How about checking the bpf_sock->op, level, and optname in
> bpf_sock_ops_setsockopt() and return -EOPNOTSUPP?

Hi Martin, thanks for the review.

I'm working whit return -EOPNOTSUPP. I've completed whit the code of fix and test, and will send the
patch later.

The fix is:

diff --git a/net/core/filter.c b/net/core/filter.c
index fcfcb72663ca..911ff04bca5a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5833,6 +5833,11 @@ BPF_CALL_5(bpf_sock_ops_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
if (!is_locked_tcp_sock_ops(bpf_sock))
return -EOPNOTSUPP;

+ if ((bpf_sock->op == BPF_SOCK_OPS_HDR_OPT_LEN_CB ||
+ bpf_sock->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB) &&
+ IS_ENABLED(CONFIG_INET) && level == SOL_TCP && optname == TCP_NODELAY)
+ return -EOPNOTSUPP;
+
return _bpf_setsockopt(bpf_sock->sk, level, optname, optval, optlen);
}

--
Thanks,
KaFai