Re: [RFC net] tls: TLS_SW sendfile() stalls at large MSS

From: Eric Dumazet

Date: Thu Jun 04 2026 - 09:16:56 EST


On Wed, Jun 3, 2026 at 11:53 PM WindowsForum.com <admin@xxxxxxxxxxxxxxxx> wrote:
>
> Thanks for testing. The non-reproduction is maybe now the key data point. My reproducer omitted a precondition my hosts happened to meet: a low net.ipv4.tcp_notsent_lowat. To reproduce, add before running:
>
> sysctl -w net.ipv4.tcp_notsent_lowat=16384
>
> Root cause
> ----------
> The stalling hosts have tcp_notsent_lowat=16384 (local web tuning); the stock default is effectively disabled. A TLS 1.3 record is 16406 bytes (TLS_MAX_PAYLOAD_SIZE 16384 + 22), just above that watermark -- so once tls_sw queues a single completed record, notsent (16406) exceeds the lowat, tcp_stream_memory_free() returns false, and tls_sw parks in sk_stream_wait_memory() holding exactly one corked record (the notsent:16406 + persist state from the original dump). With the default lowat, tls_sw keeps queuing, the MSG_MORE cork flushes at each sendfile() boundary, packets_out stays non-zero, and the persist timer never arms -- which is why stock kernels don't show it.
>
> Three conditions must coincide:
> (a) MSG_MORE forwarded on a completed record -> the sub-MSS record is corked [the bug];
> (b) tcp_notsent_lowat < one TLS record (16406) -> tls_sw blocks after that one record instead of streaming past it [the trigger I'd omitted];
> (c) large MSS -> the record is sub-MSS, so the cork engages [the amplifier].
>
> Confirmed by flipping only that knob: on a stalling host, restoring the default lowat -> 2.89 GiB/s; on a healthy host, setting lowat=16384 -> stalls (~0.0001 GiB/s). Everything that merely correlated (kernel build, congestion control/qdisc, wmem/rmem, tcp_mem, tcp_limit_output_bytes, CPU count, AES-GCM impl) was flip-tested and ruled out.
>
> This doesn't change the proposed fix: clearing MSG_MORE for a full record sends it immediately, so the deadlock can't form regardless of tcp_notsent_lowat.
>
> If you had not submitted your reply I don't think I would have kept testing it - hope this information is useful to the group.


I see no reason using such a small net.ipv4.tcp_notsent_lowat

Recommended/practical value for this sysctl is 2MB to reach line rate
on modern NICS.

tcp sendmsg() has an skb granularity; an skb packs around 64KB of
payload (this might be bigger if BIG TCP is enabled),
so 16KB is clearely asking for troubles.

>
>
> On Wed, Jun 3, 2026 at 11:12 PM Jiayuan Chen <jiayuan.chen@xxxxxxxxx> wrote:
> >
> >
> > On 6/4/26 1:19 AM, Mike Fara wrote:
> > > Hi,
> > >
> > > Software-kTLS (TLS_SW) TX over sendfile()/splice() drops to the TCP
> > > persist-timer cadence (tens of KB/s, with individual sendfile() calls blocking
> > > for tens of seconds) when the path MSS is large -- e.g. loopback (MSS 65483) or
> > > jumbo frames. At a typical 1448-byte MSS it does not occur. Plain TCP
> > > sendfile() on the same path is unaffected, and kTLS write() (no splice) is
> > > unaffected, so it is specific to TLS_SW + the splice/sendfile path.
> > >
> > > It triggers only on large-MSS paths with software kTLS (no NIC TLS offload), so
> > > it is a niche path -- but it is a clean, reproducible multi-order-of-magnitude
> > > cliff, so it seems worth a look. Reproduces on current mainline. CCing David
> > > Howells as the author of the 2023 sendpage->MSG_SPLICE_PAGES splice_to_socket()
> > > rework referenced below, and Eric/Paolo as this is as much a TCP-corking
> > > interaction as a TLS one.
> > >
> > > Environment
> > > -----------
> > > - net/tls TLS_SW (no NIC offload; ethtool tls-hw-tx-offload: off [fixed]).
> > > - AES-GCM; gcm(aes) resolves to generic-gcm-vaes-avx512.
> > >
> > > Reproducer (no OpenSSL/handshake; TLS_TX programmed with a fixed key, the
> > > receiver discards ciphertext, like tools/testing/selftests/net/tls.c):
> > >
> > > cc -O2 -Wall -o ktls_sendfile_stall ktls_sendfile_stall.c
> > > ./ktls_sendfile_stall # default loopback MSS (65483)
> > > ./ktls_sendfile_stall 1448 # clamp sender MSS via TCP_MAXSEG
> > >
> > > Observed (loopback, single box):
> > >
> > > MSS=default sent= 4.0 MiB in 52.08s => 0.0001 GiB/s (stalled)
> > > MSS=1448 sent= 2048.0 MiB in 1.65s => 1.2106 GiB/s
> > >
> > > i.e. ~four orders of magnitude; at the default MSS a single sendfile() blocks
> > > for tens of seconds. For contrast, on the same loopback path:
> > >
> > > plain TCP sendfile() (no TLS ULP): 7.87 GiB/s
> > > kTLS write() (TLS_SW, no splice, 2 GiB): 1.99 GiB/s
> >
> >
> > I tested your ktls_sendfile_stall.c under stable 6.6 and upstream, but
> > both of them works correctly.
> >
> > ~/code/tmp$ ./ktls_sendfile_stall 1448 MSS=1448 sent= 2048.0 MiB in
> > 2.32s => 0.8603 GiB/s ~/code/tmp$ ./ktls_sendfile_stall 1448 MSS=1448
> > sent= 2048.0 MiB in 2.37s => 0.8439 GiB/s ~/code/tmp$
> > ./ktls_sendfile_stall MSS=default sent= 2048.0 MiB in 1.64s => 1.2204
> > GiB/s :~/code/tmp$ ./ktls_sendfile_stall MSS=default sent= 2048.0 MiB in
> > 1.70s => 1.1737 GiB/s ~/code/tmp$ ./ktls_sendfile_stall 1448 MSS=1448
> > sent= 2048.0 MiB in 2.33s => 0.8570 GiB/s
> >
> > ~/code/tmp$ ethtool -k lo | grep "tls-hw-tx-offload" tls-hw-tx-offload:
> > off [fixed]
> >