Re: [PATCH] can: j1939: fix wrong rx timeout for CTS hold messages

From: Hölzl, Alexander

Date: Thu Apr 23 2026 - 09:45:10 EST

Am 23.04.2026 um 15:07 schrieb Oleksij Rempel:

On Thu, Apr 23, 2026 at 11:35:27AM +0200, Hölzl, Alexander wrote:

Hello Oleksij,
thank you for your quick review!

Am 23.04.2026 um 05:50 schrieb Oleksij Rempel:

Hi Alexander,

On Tue, Apr 21, 2026 at 05:31:54PM +0200, Alexander Hölzl wrote:

In J1939 segmented transport, a CTS message with data byte 2 set to zero is interpreted as a hold message.
This instructs the transmitter of the segmented message to hold the connection open but to delay sending.
According to the J1939-21 standard, section 5.10.2.4 the timeout T4 after which an held open session is invalidated is
1050 ms, not 550 as implemented currently.
The 550 ms are problematic if a device uses hold messages and assumes it can wait for more than 550 ms before it has
to resend the hold message.

This patch changes the T4 timeout used in the implementation from 550 ms to 1050.

Signed-off-by: Alexander Hölzl <alexander.hoelzl@xxxxxxx>

LGTM. Thank you!

Acked-by: Oleksij Rempel <o.rempel@xxxxxxxxxxxxxx>

Sashico detected one more potential issue, not related to this patch:
https://sashiko.dev/#/patchset/20260421153152.87772-3-alexander.hoelzl%40gmx.net

If you have time, can you please verify it?

I just tried it and to be honest it seems that holds are fundamentally
broken currently. I don't think there is any way to restart normal
communication as soon as a hold has been received.

When I send a hold with byte 3 set to FF and try to resume from sequence
number 1 I get an abort with reason 08 which is "Duplicate sequence number"
according to the spec:
(000.000000) can0 18EC31F9 [8] 10 0A 00 02 02 00 AB 00
(000.001166) can0 18ECF931 [8] 11 00 FF FF FF 00 AB 00
(000.101138) can0 18ECF931 [8] 11 02 01 FF FF 00 AB 00
(000.000685) can0 18EC31F9 [8] FF 08 FF FF FF 00 AB 00

The same happens when setting byte 3 to 01:
(000.000000) can0 18EC31F9 [8] 10 0A 00 02 02 00 AB 00
(000.001077) can0 18ECF931 [8] 11 00 01 FF FF 00 AB 00
(000.100910) can0 18ECF931 [8] 11 02 01 FF FF 00 AB 00
(000.000657) can0 18EC31F9 [8] FF 08 FF FF FF 00 AB 00

Setting it to 0 is disallowed as well and the transmission is cancelled
immediatley with error 05 which is "Maximum retransmit request limit
reached.":
(000.000000) can0 18EC31F9 [8] 10 0A 00 02 02 00 AB 00
(000.000941) can0 18ECF931 [8] 11 00 00 FF FF 00 AB 00
(000.000645) can0 18EC31F9 [8] FF 05 FF FF FF 00 AB 00

There is a check at the beggining of j1939_xtp_rx_cts_one for duplicate
sequence numbers which targets byte 0, so the command type byte, and checks
that it is not equal to the last command.

if (session->last_cmd == dat[0]) {
err = J1939_XTP_ABORT_DUP_SEQ;
goto out_session_cancel;
}

This means it is impossible to handle two directly succeeding CTS which
would be necessary to escape the hold....

The easiest way to fix this would probably be to move the check for a hold
message all the way to the top of j1939_xtp_rx_cts_one and if a hold message
has been received just set the rx-timeout timer and then bail?

From a quick lock, it sounds plausible. Will you send a patch?

Hm... we needs tests, preferably in kernel source to avoid regressions.

would it be possible to implement is on top of kunit tests?
https://lore.kernel.org/all/20260420152228.581421-1-o.rempel@xxxxxxxxxxxxxx/

It looks like there is more user space friendly testing used:
https://lore.kernel.org/all/20260419144600.GA4091724@chcpu16/

Sure I'll write a patch and tests!