Re: [PATCH 2/2] serial: qcom-geni: Fix qcom_geni_serial_stop_tx_fifo() while xfer

From: Doug Anderson
Date: Thu May 30 2024 - 18:51:13 EST


Hi,

On Fri, May 24, 2024 at 5:17 PM Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
>
> 1. It appears that M_GP_LENGTH can still advance after the FIFO
> becomes 0, which is extra proof that the transfer is still happening
> even though the FIFO says it's done. Presumably we could keep track of
> how many bytes we have enqueued into the FIFO for this command and
> then compare. As I was trying to do this, though, I noticed another
> option...
>
> 2. It appears that instead of "cancelling" the current command we can
> just issue a new 0-byte transfer and wait for the 0-byte transfer to
> be "done". This causes geni to give us back a "M_CMD_OVERRUN"
> interrupt, but that's fine and we can ignore it. That interrupt just
> says "hey, you gave me a command before the previous one was done" but
> it does seem to properly accept the new command and it doesn't drop
> any bytes.
>
> ...it turns out that we (apparently) already have been using option #2
> to interrupt a transfer without dropping bytes. When the UART is
> shared between an agetty and the kernel console this happens all the
> time. In qcom_geni_serial_console_write() we'll issue a new command
> before finishing a current one and then re-issue the current command
> with any remaining bytes. So not only should this be safe but it's
> already tested to work.
>
> I'll need to spend a little more time on this to really confirm it
> works as I expect and I'll send up a v2 using approach #2.

Just to connect the dots more, I did more testing and option #2
totally didn't work. The console/kdb stuff was working mostly through
ugly fragile hacks. Polling GP_LENGTH seemed like the only option,
though when I dug more I found that this wasn't the right place to do
it anyway...


> Also note that while spending more time on this I found _yet another_
> bug, this one more serious. My original testing was done on kernel 6.6
> (with stable backports) and I just did confirmation on mainline.
> That's why I didn't see this new bug originally. ...but this time I
> spent more time testing on mainline. It turns out that the recent
> patches for kfifo on mainline have badly broken geni serial.
> Specifically, if you just do "cat /var/log/messages" and then "Ctrl-C"
> the machine will hard lockup! Yikes! This is yet another side effect
> of the geni "packet"-based protocol biting us (so related to the
> problems in ${SUBJECT}, but not identical). Whenever we setup a TX
> transfer we set it up for the number of bytes in the queue at the
> time. If that number goes down then we're in trouble. Specifically, it
> can be noted that:
> * When we start transmitting we look at the current queue size, setup
> a transfer, and store "tx_remaining".
> * Whenever there's space in the FIFO we add bytes and remove them from
> the queue and "tx_remaining".
> * We don't ever expect bytes to disappear from the queue. I think in
> the old code if this happened we're just transfer some bogus bytes.
> Now we'll loop in qcom_geni_serial_send_chunk_fifo() because
> uart_fifo_out() will keep returning 0.
>
> I'll try to take a gander at that, too...

I spent _far_ too long poking at this and I've finally come up with a
v2 that _I think_ fixes everything. For reference:

https://lore.kernel.org/r/20240530224603.730042-1-dianders@xxxxxxxxxxxx