Re: [PATCH 1/1] can: m_can: Control tx flow to avoid message stuck

From: Markus Schneider-Pargmann
Date: Wed Jan 15 2025 - 10:23:09 EST


On Mon, Jan 13, 2025 at 03:58:09PM +0000, Mohan, Subramanian wrote:
> Hi @Markus Schneider-Pargmann,
>
> > -----Original Message-----
> > From: Markus Schneider-Pargmann <msp@xxxxxxxxxxxx>
> > Sent: Thursday, January 9, 2025 9:14 PM
> > To: Mohan, Subramanian <subramanian.mohan@xxxxxxxxx>
> > Cc: rcsekar@xxxxxxxxxxx; davem@xxxxxxxxxxxxx; edumazet@xxxxxxxxxx;
> > kuba@xxxxxxxxxx; pabeni@xxxxxxxxxx; balbi@xxxxxxxxxx; Tan, Raymond
> > <raymond.tan@xxxxxxxxx>; jarkko.nikula@xxxxxxxxxxxxxxx; linux-
> > can@xxxxxxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > linux@xxxxxxxxxxxxxxx; lst@xxxxxxxxxxxxxx; Hahn, Matthias
> > <matthias.hahn@xxxxxxxxx>; Chinnadurai, Srinivasan
> > <srinivasan.chinnadurai@xxxxxxxxx>
> > Subject: Re: [PATCH 1/1] can: m_can: Control tx flow to avoid message stuck
> >
> > Hi,
> >
> > On Wed, Jan 08, 2025 at 02:31:12PM +0530,
> > subramanian.mohan@xxxxxxxxx wrote:
> > > From: Subramanian Mohan <subramanian.mohan@xxxxxxxxx>
> > >
> > > The prolonged testing of passing can messages between two Elkhartlake
> > > platforms resulted in message stuck i.e Message did not receive at
> > > receiver side
> >
> > Can you please describe the reason for the stuck messages in your commit
> > message? I am reading this but I don't understand why this happens or why
> > your proposed solution helps.
>
> Let me describe problem bit more:
> We are using 2 different Python Scripts(client and server) on both of the Elkhart lake connected systems.
> The "server" script sends out messages with Arbitration ID's, and then waits for a response. If the Arbitration ID is different than the
> one expected or no message arrives it logs an error.
> The "client" script listens for messages, and depending on the Arbitration ID received it sends a message with a specific Arbitration ID back.
> We have deployed both the scripts in 2 different systems and triggered the testing
> If any message is lost/stuck then the "server" - Script will log an error.
> The Message stuck corresponds over here, whenever the server sends out message and waits for reply, we wont me getting the reply message
> On the server side. Even though time slice increase in scripts did not help. On further debugging enabling TX/TEFN impacts the processing load.
> To overcome this we disabled the TX/TEFN interrupt once processed and enable it back in the TX start xmit function.

Sorry, I still don't understand what is happening in the driver/hardware
that requires the interrupt to be disabled and enabled again later. Is
it causing interrupt_handler runs that are not associated with a real
interrupt? Is it causing real interrupts without anything being sent?
Does the clearing of the interrupts not work?

>
> >
> > >
> > > Contolling TX i.e TEFN bit helped to resolve the message stuck issue.
> > >
> > > The current solution is enhanced/optimized from the below patch:
> > > https://lore.kernel.org/lkml/20230623051124.64132-1-kumari.pallavi@int
> > > el.com/T/
> > >
> > > Setup used to reproduce the issue:
> > >
> > > +---------------------+ +----------------------+
> > > |Intel ElkhartLake | |Intel ElkhartLake |
> > > | +--------+ | | +--------+ |
> > > | |m_can 0 | |<=======>| |m_can 0 | |
> > > | +--------+ | | +--------+ |
> > > +---------------------+ +----------------------+
> > >
> > > Steps to be run on the two Elkhartlake HW:
> > > 1)Bus-Rate is 1 MBit/s
> > > 2)Busload during the test is about 40% 3)we initialize the CAN with
> > > following commands 4)ip link set can0 txqueuelen 100/1024/2048 5)ip
> > > link set can0 up type can bitrate 1000000
> > >
> > > Python scripts are used send and receive the can messages between the
> > > EHL systems.
> > >
> > > Signed-off-by: Hahn Matthias <matthias.hahn@xxxxxxxxx>
> > > Signed-off-by: Subramanian Mohan <subramanian.mohan@xxxxxxxxx>
> > > ---
> > > drivers/net/can/m_can/m_can.c | 11 +++++++++--
> > > 1 file changed, 9 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/can/m_can/m_can.c
> > > b/drivers/net/can/m_can/m_can.c index 97cd8bbf2e32..0a2c9a622842
> > > 100644
> > > --- a/drivers/net/can/m_can/m_can.c
> > > +++ b/drivers/net/can/m_can/m_can.c
> > > @@ -1220,7 +1220,7 @@ static void m_can_coalescing_update(struct
> > > m_can_classdev *cdev, u32 ir) static int
> > > m_can_interrupt_handler(struct m_can_classdev *cdev) {
> > > struct net_device *dev = cdev->net;
> > > - u32 ir = 0, ir_read;
> > > + u32 ir = 0, ir_read, new_interrupts;
> > > int ret;
> > >
> > > if (pm_runtime_suspended(cdev->dev)) @@ -1283,6 +1283,9 @@
> > static
> > > int m_can_interrupt_handler(struct m_can_classdev *cdev)
> > > ret = m_can_echo_tx_event(dev);
> > > if (ret != 0)
> > > return ret;
> > > +
> > > + new_interrupts = cdev->active_interrupts &
> > ~(IR_TEFN);
> > > + m_can_interrupt_enable(cdev, new_interrupts);
> >
> > Here is a theoretical situation of two messages being sent. The first is being
> > sent and handled in this interrupt handler. Then it would disable the TEFN bit
> > right? If the second message wasn't done sending yet, how would it ever call
> > the interrupt handler if the interrupt is disabled?
>
> With this patch we are controlling only TEFN/TX interrupt bit, rest of the interrupts remains unaffected.
> Since We are enabling/disabling TEFN bit only, interrupt handler will be called normally with other interrupts.

I am worried about the next TEFN interrupt here. I am currently not
sure, can the mcan driver for non-peripheral setups have multiple
messages being sent at the same time? If that is possible then what
happens to the other messages that are in flight when you disable TEFN?

Best
Markus

>
> >
> > Also you are disabling this interrupt here regardless of the type of mcan device
> > and also regardless of the coalescing state. In the transmit part you are only
> > enabling it for non-peripheral devices. For peripheral mcan devices this would
> > also introduce an additional two transfers per transmit.
>
> TEFN bit enabling/disabling applies only to non-peripheral device.
> While disabling the TEFN bit in interrupt handler, we will add the check for non-peripheral device before disabling it(V2).
> On the coalescing state, The snapshot is already taken while entering the interrupt handler.
>
> >
> > In which situations is this really necessary? Does it help to implement
> > coalescing for non-peripheral devices?
> This helps in heavy load/traffic conditions
> Not exactly sure on the coalescing part.
>
> Thanks,
> Subbu
>
> >
> > Best
> > Markus
> >
> > > }
> > > }
> > >
> > > @@ -1989,6 +1992,7 @@ static netdev_tx_t m_can_start_xmit(struct
> > sk_buff *skb,
> > > struct m_can_classdev *cdev = netdev_priv(dev);
> > > unsigned int frame_len;
> > > netdev_tx_t ret;
> > > + u32 new_interrupts;
> > >
> > > if (can_dev_dropped_skb(dev, skb))
> > > return NETDEV_TX_OK;
> > > @@ -2008,8 +2012,11 @@ static netdev_tx_t m_can_start_xmit(struct
> > > sk_buff *skb,
> > >
> > > if (cdev->is_peripheral)
> > > ret = m_can_start_peripheral_xmit(cdev, skb);
> > > - else
> > > + else {
> > > + new_interrupts = cdev->active_interrupts | IR_TEFN;
> > > + m_can_interrupt_enable(cdev, new_interrupts);
> > > ret = m_can_tx_handler(cdev, skb);
> > > + }
> > >
> > > if (ret != NETDEV_TX_OK)
> > > netdev_completed_queue(dev, 1, frame_len);
> > > --
> > > 2.35.3
> > >

Attachment: signature.asc
Description: PGP signature