Re: [PATCH] can: j1939: fix wrong rx timeout for CTS hold messages
From: Oleksij Rempel
Date: Thu Apr 23 2026 - 09:14:37 EST
On Thu, Apr 23, 2026 at 11:35:27AM +0200, Hölzl, Alexander wrote:
>
> Hello Oleksij,
> thank you for your quick review!
>
> Am 23.04.2026 um 05:50 schrieb Oleksij Rempel:
> > Hi Alexander,
> >
> > On Tue, Apr 21, 2026 at 05:31:54PM +0200, Alexander Hölzl wrote:
> > > In J1939 segmented transport, a CTS message with data byte 2 set to zero is interpreted as a hold message.
> > > This instructs the transmitter of the segmented message to hold the connection open but to delay sending.
> > > According to the J1939-21 standard, section 5.10.2.4 the timeout T4 after which an held open session is invalidated is
> > > 1050 ms, not 550 as implemented currently.
> > > The 550 ms are problematic if a device uses hold messages and assumes it can wait for more than 550 ms before it has
> > > to resend the hold message.
> > >
> > > This patch changes the T4 timeout used in the implementation from 550 ms to 1050.
> > >
> > > Signed-off-by: Alexander Hölzl <alexander.hoelzl@xxxxxxx>
> >
> > LGTM. Thank you!
> >
> > Acked-by: Oleksij Rempel <o.rempel@xxxxxxxxxxxxxx>
> >
> >
> > Sashico detected one more potential issue, not related to this patch:
> > https://sashiko.dev/#/patchset/20260421153152.87772-3-alexander.hoelzl%40gmx.net
> >
> > If you have time, can you please verify it?
> I just tried it and to be honest it seems that holds are fundamentally
> broken currently. I don't think there is any way to restart normal
> communication as soon as a hold has been received.
>
> When I send a hold with byte 3 set to FF and try to resume from sequence
> number 1 I get an abort with reason 08 which is "Duplicate sequence number"
> according to the spec:
> (000.000000) can0 18EC31F9 [8] 10 0A 00 02 02 00 AB 00
> (000.001166) can0 18ECF931 [8] 11 00 FF FF FF 00 AB 00
> (000.101138) can0 18ECF931 [8] 11 02 01 FF FF 00 AB 00
> (000.000685) can0 18EC31F9 [8] FF 08 FF FF FF 00 AB 00
>
> The same happens when setting byte 3 to 01:
> (000.000000) can0 18EC31F9 [8] 10 0A 00 02 02 00 AB 00
> (000.001077) can0 18ECF931 [8] 11 00 01 FF FF 00 AB 00
> (000.100910) can0 18ECF931 [8] 11 02 01 FF FF 00 AB 00
> (000.000657) can0 18EC31F9 [8] FF 08 FF FF FF 00 AB 00
>
> Setting it to 0 is disallowed as well and the transmission is cancelled
> immediatley with error 05 which is "Maximum retransmit request limit
> reached.":
> (000.000000) can0 18EC31F9 [8] 10 0A 00 02 02 00 AB 00
> (000.000941) can0 18ECF931 [8] 11 00 00 FF FF 00 AB 00
> (000.000645) can0 18EC31F9 [8] FF 05 FF FF FF 00 AB 00
>
> There is a check at the beggining of j1939_xtp_rx_cts_one for duplicate
> sequence numbers which targets byte 0, so the command type byte, and checks
> that it is not equal to the last command.
>
> if (session->last_cmd == dat[0]) {
> err = J1939_XTP_ABORT_DUP_SEQ;
> goto out_session_cancel;
> }
>
> This means it is impossible to handle two directly succeeding CTS which
> would be necessary to escape the hold....
>
> The easiest way to fix this would probably be to move the check for a hold
> message all the way to the top of j1939_xtp_rx_cts_one and if a hold message
> has been received just set the rx-timeout timer and then bail?
>From a quick lock, it sounds plausible. Will you send a patch?
Hm... we needs tests, preferably in kernel source to avoid regressions.
would it be possible to implement is on top of kunit tests?
https://lore.kernel.org/all/20260420152228.581421-1-o.rempel@xxxxxxxxxxxxxx/
It looks like there is more user space friendly testing used:
https://lore.kernel.org/all/20260419144600.GA4091724@chcpu16/
--
Pengutronix e.K. | |
Steuerwalder Str. 21 | http://www.pengutronix.de/ |
31137 Hildesheim, Germany | Phone: +49-5121-206917-0 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |