Re: [PATCH,net-next] tcp: Add TCP ROCCET congestion control module.

From: Neal Cardwell

Date: Tue Apr 07 2026 - 17:48:58 EST

On Sun, Apr 5, 2026 at 3:51 AM Tim Fuechsel <t.fuechsel@xxxxxx> wrote:
>
> TCP ROCCET is an extension of TCP CUBIC that improves its overall
> performance. By its mode of function, CUBIC causes bufferbloat while
> it tries to detect the available throughput of a network path. This is
> particularly a problem with large buffers in mobile networks. A more
> detailed description and analysis of this problem caused by TCP CUBIC
> can be found in [1].

Thanks for posting this patch. I agree that improving the bufferbloat
caused by CUBIC and similar algorithms is an important area for work.

> TCP ROCCET addresses this problem by adding two
> additional metrics to detect congestion (queueing and bufferbloat)
> on a network path.

Normally I think of bufferbloat as excessive queuing. Thus normally I
would think of queuing and bufferbloat as essentially the same metric.
So it seems confusing for this sentence to claim that these are two
additional metrics rather than one.

Furthermore, these mentions of "queueing" and "bufferbloat" are the
last time those words appear in the entire patch. It's unclear to the
reader how your high-level description in this sentence connects with
the algorithm or the code.

Please clarify in the commit description what you mean by "two
additional metrics to detect congestion (queueing and bufferbloat)",
how you define "queueing", how you define "bufferbloat", and how the
algorithm measures and uses these metrics.

> TCP ROCCET achieves better performance than CUBIC
> and BBRv3, by maintaining similar throughput while reducing the latency.

Please reference figures in the paper and mention specific concrete
numerical examples of latency reductions to quantify these statements.

> In addition, TCP ROCCET does not have fairness issues when sharing a
> link with TCP CUBIC and BBRv3.

Can you please elaborate on this statement here? AFAICT from figures 7
and 8 in https://arxiv.org/pdf/2510.25281 it seems ROCCET is
essentially starved by CUBIC when sharing a bottleneck with CUBIC when
the bottleneck has 2*BDP or more of buffering. AFAICT it sounds like
ROCCET does have "fairness issues when sharing a link with TCP CUBIC"?

> A paper that evaluates the performance
> and function of TCP ROCCET has already been peer-reviewed and will be
> presented at the WONS 2026 conference. A draft of this paper can be
> found here [2].
>
> [1] https://doi.org/10.1109/VTC2023-Fall60731.2023.10333357
> [2] https://arxiv.org/abs/2510.25281

Thanks for the links to the paper on ROCCET. This is very helpful.

> Signed-off-by: Lukas Prause <lukas.prause@xxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Tim Fuechsel <t.fuechsel@xxxxxx>
> ---
> net/ipv4/Kconfig | 11 +
> net/ipv4/Makefile | 1 +
> net/ipv4/tcp_roccet.c | 686 ++++++++++++++++++++++++++++++++++++++++++
> net/ipv4/tcp_roccet.h | 60 ++++
> 4 files changed, 758 insertions(+)
> create mode 100644 net/ipv4/tcp_roccet.c
> create mode 100644 net/ipv4/tcp_roccet.h
>
> diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
> index 21e5164e30db..33625111c7f0 100644
> --- a/net/ipv4/Kconfig
> +++ b/net/ipv4/Kconfig
> @@ -663,6 +663,17 @@ config TCP_CONG_CDG
> delay gradients." In Networking 2011. Preprint:
> http://caia.swin.edu.au/cv/dahayes/content/networking2011-cdg-preprint.pdf
>
> +config TCP_CONG_ROCCET
> + tristate "ROCCET TCP"
> + default n
> + help
> + TCP ROCCET is a sender-side only modification of the TCP CUBIC
> + protocol stack that optimizes the performance of TCP congestion

s/TCP CUBIC protocol stack/TCP CUBIC congestion control algorithm/

> + control. Especially for networks with large buffers (wireless,
> + cellular networks), TCP ROCCET has improved performance by maintaining
> + similar throughput as CUBIC while reducing the latency.
> + For more information, see: https://arxiv.org/abs/2510.25281

nit: AFAICT there's an extra space in front of the word "For".

> +
> config TCP_CONG_BBR
> tristate "BBR TCP"
> default n
> diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
> index 7f9f98813986..82ed7989dcb3 100644
> --- a/net/ipv4/Makefile
> +++ b/net/ipv4/Makefile
> @@ -45,6 +45,7 @@ obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
> obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
> obj-$(CONFIG_INET_RAW_DIAG) += raw_diag.o
> obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o
> +obj-$(CONFIG_TCP_CONG_ROCCET) += tcp_roccet.o
> obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
> obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
> obj-$(CONFIG_TCP_CONG_CUBIC) += tcp_cubic.o
> diff --git a/net/ipv4/tcp_roccet.c b/net/ipv4/tcp_roccet.c
> new file mode 100644
> index 000000000000..b0ec3053182f
> --- /dev/null
> +++ b/net/ipv4/tcp_roccet.c
> @@ -0,0 +1,686 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * TCP ROCCET: An RTT-Oriented CUBIC Congestion Control
> + * Extension for 5G and Beyond Networks
> + *
> + * TCP ROCCET is a new TCP congestion control
> + * algorithm suited for current cellular 5G NR beyond networks.
> + * It extends the kernel default congestion control CUBIC
> + * and improves its performance, and additionally solves an
> + * unwanted side effects of CUBIC’s implementation.

Please specify what side effect or side effects ROCCET is claiming to
solve (presumably bufferbloat?).

> + * ROCCET uses its own Slow Start, called LAUNCH, where loss
> + * is not considered as a congestion event.

Expressed in isolation like this, that sounds potentially dangerous.
Please mention what signal(s) ROCCET uses to exit slow start if it's
not using loss.

In addition, from reading the code AFAICT the connection does use loss
to exit slow start (see my remarks below in this message). So AFAICT
this summary seems inaccurate, or at least misleading?

> +static __always_inline void update_min_rtt(struct sock *sk)
> +{
> + struct roccettcp *ca = inet_csk_ca(sk);
> + u32 now = jiffies_to_usecs(tcp_jiffies32);
> +
> + if (now - ca->curr_min_rtt_timed.time >
> + ROCCET_RTT_LOOKBACK_S * USEC_PER_SEC &&
> + calculate_min_rtt) {
> + u32 new_min_rtt = max(ca->curr_rtt, 1);
> + u32 old_min_rtt = ca->curr_min_rtt_timed.rtt;
> +
> + u32 interpolated_min_rtt =
> + (new_min_rtt * roccet_min_rtt_interpolation_factor +
> + old_min_rtt *
> + (100 - roccet_min_rtt_interpolation_factor)) /
> + 100;
> +
> + ca->curr_min_rtt_timed.rtt = interpolated_min_rtt;
> + ca->curr_min_rtt_timed.time = now;
> + }

If no lower RTT is found for 10 seconds, the algorithm interpolates
the `min_rtt` upwards towards the current RTT.

+ If the path is persistently congested (e.g., a large buffer is
constantly full), the `min_rtt` baseline will drift up.

+ This makes the algorithm less sensitive to queueing delay over
time, potentially defeating the purpose of reducing bufferbloat in the
long run. Contrast this with BBR, which actively drains the queue
(using the ProbeRTT mechanism) to try to find the true physical
minimum RTT.

Can you please add a comment explaining why the ROCCET algorithm takes
this approach, and how the algorithm expects to avoid queues that
ratchet ever higher?

> +/* Update ack rate sampled by 100ms.
> + */
> +static __always_inline void update_ack_rate(struct sock *sk)
> +{
> + struct roccettcp *ca = inet_csk_ca(sk);
> + u32 now = jiffies_to_usecs(tcp_jiffies32);
> + u32 interval = USEC_PER_MSEC * 100;
> +
> + if ((u32)(now - ca->ack_rate.last_rate_time) >= interval) {
> + ca->ack_rate.last_rate_time = now;
> + ca->ack_rate.last_rate = ca->ack_rate.curr_rate;
> + ca->ack_rate.curr_rate = ca->ack_rate.cnt;
> + ca->ack_rate.cnt = 0;
> + } else {
> + ca->ack_rate.cnt += 1;
> + }

Here, `cnt` is incremented by `1` on every call, regardless of the
`acked` value (number of packets ACKed in this event).

+ This measures ACK frequency rather than data delivery rate.

+ This approach is highly sensitive to receiver behavior like
Delayed ACKs, LRO (Large Receive Offload), or GRO (Generic Receive
Offload), where one ACK event might acknowledge many packets.

+ To measure rate, I would suggest accumulating bytes ACKed (or
packets ACKed) rather than just counting the number of ACK events.

> + if (ca->bw_limit.next_check == 0)
> + ca->bw_limit.next_check = now + 5 * ca->curr_rtt;
> +
> + ca->bw_limit.sum_cwnd += tcp_snd_cwnd(tp);
> + ca->bw_limit.sum_acked += acked;
> +
> + if (ca->bw_limit.next_check < now) {

This comparison (ca->bw_limit.next_check < now) does not properly
handle wrapping of the 32-bit timestamps. You probably want to
subtract the two numbers and look at the result, since subtraction
will handle the wrapping. Please see how tcp_cubic uses tcp_jiffies32
for examples of how to do this.

> + /* We send more data as we got acked in the last 5 RTTs */

This seems to have a typo and seems to intend to say: "We sent
significantly more data than we got acked in the last 5 RTTs".

> + if ((ca->bw_limit.sum_cwnd * 100) / ca->bw_limit.sum_acked >=
> + ack_rate_diff_ca)
> + bw_limit_detect = 1;

AFAICT this logic for updating and using ca->bw_limit.sum_cwnd appears
to be mathematically flawed for its stated purpose:

+ `sum_cwnd` is accumulated on every ACK event. Over a period of 5
RTTs, if we assume continuous sending at window size $W$, the number
of ACK events is roughly proportional to $W$. Thus, `sum_cwnd` will be
roughly $5 * num_acks_per_round * W$.

+ `sum_acked` accumulates the number of packets ACKed (`acked`). Over
5 RTTs of continuous sending, this will simply be roughly the number
of packets ACKed, which is roughly $5 * W$ (if the flow is not
application-limited).

+ The quantity `sum_cwnd * 100 / sum_acked` will therefore be roughly
$(5 * num_acks_per_round * W) * 100/ (5 * W) = num_acks_per_round *
100 $, not a measure of bandwidth limitation (it does not tell you if
you are really sending more data than is being ACKed).

+ With the default `ack_rate_diff_ca` of `200`, this condition will
become true for $sum_cwnd * 100 / sum_acked >= 200$, i.e.
$num_acks_per_round * 100 >= 200$. So AFAICT we expect this condition
to be true if there are 2 or more ACKs in a round trip. This makes
`bw_limit_detect` effectively a no-op or always-on trigger rather than
a true detector of queue growth or bandwidth limits.

If you want to really check whether the connection is sending
significantly more data than is being ACKed, then AFAICT you need to
address the following issues:

+ A cwnd is a per-round-trip number, not a per-ACK number (as it is
treated here).

+ Application-limited flows do not always send a full cwnd worth of
data (as the flow is assumed to do here).

+ Data sent is out of phase by one round trip with data ACKed, so if a
connection is growing its sending rate by a factor of X per round trip
then we expect the data sent in a round trip to be X times greater
than the data ACKed in that round trip even if the bottleneck
bandwidth is not saturated yet. So if you want to compare data sent vs
data ACKed, you need to keep this in mind.

Furthermore, AFAICT the ack_rate_diff_ca parameter used by this
algorithm differs massively from the value described in the paper. The
paper says: "If the amount of incoming ACKs over 5 RTTs deviates more
than 20 % from the cum_cwnd over the same time period". AFAICT
ack_rate_diff_ca is 200, thus this code checks for a 200% deviation,
not a 20% deviation.

Did the experiments in the paper use the approach documented in the
paper, or the approach documented in this code? They are very
different, AFAICT.

> +
> + /* reset struct and set next end of period */
> + ca->bw_limit.sum_cwnd = 1;
> +
> + /* set to 1 to avoid division by zero */
> + ca->bw_limit.sum_acked = 1;

Both of these are incorrect ways to reset these fields. Sums should be
reset to 0. To avoid division by zero, check for a denominator of 0
before the division.

> +__bpf_kfunc static u32 roccettcp_recalc_ssthresh(struct sock *sk)
> +{
> + const struct tcp_sock *tp = tcp_sk(sk);
> + struct roccettcp *ca = inet_csk_ca(sk);
> +
> + if (ignore_loss)
> + return tcp_snd_cwnd(tp);

Having a module parameter to ignore loss in this way makes it too easy
for users to cause excessive congestion. I would urge you to remove
that module parameter. Researchers can add that sort of mechanism in
their own code for research.

> +
> + /* Don't exit slow start if loss occurs. */
> + if (tcp_in_slow_start(tp))
> + return tcp_snd_cwnd(tp);

This comment seems incorrect. If roccettcp_recalc_ssthresh() is called
from tcp_init_cwnd_reduction() then AFAICT ssthresh will be set to
cwnd. This should cause tcp_in_slow_start() (return tcp_snd_cwnd(tp)
< tp->snd_ssthresh) to return false. So the flow should no longer be
in slow start. So AFAICT the flow has actually exited slow start.

AFAICT what the comment means to say is something like: "When loss
occurs in slow start, exit slow start but do not decrease cwnd." Is
that what you mean to say?

If so, that sounds dangerous, by itself in isolation.

+ In general, in a loss-based algorithm like CUBIC, ignoring loss in
Slow Start is extremely dangerous. Slow Start is designed to probe
capacity exponentially; if it causes losses, it usually means it has
significantly overshot the available bandwidth.

+ By returning the current `cwnd` as the new `ssthresh`, the algorithm
will not back off properly on loss during Slow Start, potentially
causing massive congestion or severe unfairness to other flows.

Can you please add a comment explaining why you feel this
roccettcp_recalc_ssthresh() behavior is safe, and what it is trying to
achieve, at a high level?

Thanks,
neal