Re: [PATCH v2 6/7] serial: qcom-geni: Fix suspend while active UART xfer
From: Doug Anderson
Date: Tue Jun 04 2024 - 12:04:50 EST
Hi,
On Fri, May 31, 2024 at 8:13 AM Ilpo Järvinen
<ilpo.jarvinen@xxxxxxxxxxxxxxx> wrote:
>
> On Thu, 30 May 2024, Douglas Anderson wrote:
>
> > On devices using Qualcomm's GENI UART it is possible to get the UART
> > stuck such that it no longer outputs data. Specifically, I could
> > reproduce this problem by logging in via an agetty on the debug serial
> > port (which was _not_ used for kernel console) and running:
> > cat /var/log/messages
> > ...and then (via an SSH session) forcing a few suspend/resume cycles.
> >
> > Digging into this showed a number of problems that are all related.
> >
> > The root of the problems was with qcom_geni_serial_stop_tx_fifo()
> > which is called as part of the suspend process. Specific problems with
> > that function:
> > - When we cancel an in-progress "tx" command it doesn't appear to
> > fully drain the FIFO. That meant qcom_geni_serial_tx_empty()
> > continued to report that the FIFO wasn't empty. The
> > qcom_geni_serial_start_tx_fifo() function didn't re-enable
> > interrupts in this case so we'd never start transferring again.
> > - We cancelled the current "tx" command but we forgot to zero out
> > "tx_remaining". This confused logic elsewhere in the driver
> > - From experimentation, it appears that cancelling the "tx" command
> > could drop some of the queued up bytes. While maybe not the end of
> > the world, it doesn't seem like we should be dropping bytes when
> > stopping the FIFO, which is defined more of a "pause".
> >
> > One idea to fix the above would be to add FIFO draining to
> > qcom_geni_serial_stop_tx_fifo(). However, digging into the
> > documentation in serial_core.h for stop_tx() makes this seem like the
> > wrong choice. Specifically stop_tx() is called with local interrupts
> > disabled. Waiting for a FIFO (which might be 64 bytes big) to drain at
> > 115.2 kbps doesn't seem like a wise move.
> >
> > Ideally qcom_geni_serial_stop_tx_fifo() would be able to pause the
> > transmitter, but nothing in the documentation for the GENI UART makes
> > me believe that is possible.
> >
> > Given the lack of better choices, we'll change
> > qcom_geni_serial_stop_tx_fifo() to simply disable the
> > TX_FIFO_WATERMARK interrupt and call it a day. This seems OK as per
> > the serial core docs since stop_tx() is supposed to stop transferring
> > bytes "as soon as possible" and there doesn't seem to be any possible
> > way to stop transferring sooner. As part of this, get rid of some of
> > the extra conditions on qcom_geni_serial_start_tx_fifo() which simply
> > weren't needed and are now getting in the way. It's always fine to
> > turn the interrupts on if we want to receive and it'll be up to the
> > IRQ handler to turn them back off if somehow they're not needed. This
> > works fine.
> >
> > Unfortunately, doing just the above change causes new/different
> > problems with suspend/resume. Now if you suspend while an active
> > transfer is happening you can find that after resume time you're no
> > longer receiving UART interrupts at all. It appears to be important to
> > drain the FIFO and send a "cancel" command if the UART is active to
> > avoid this. Since we've already decided that
> > qcom_geni_serial_stop_tx_fifo() shouldn't be doing this, let's add the
> > draining / cancelling logic to the shutdown() call where it should be
> > OK to delay a bit. This is called as part of the suspend process via
> > uart_suspend_port().
> >
> > Finally, with all of the above, the test case where we're spamming the
> > UART with data and going through suspend/resume cycles doesn't kill
> > the UART and doesn't drop bytes.
> >
> > NOTE: though I haven't gone back and validated on ancient code, it
> > appears from code inspection that many of these problems have existed
> > since the start of the driver. In the very least, I could reproduce
> > the problems on vanilla v5.15. The problems don't seem to reproduce
> > when using the serial port for kernel console output and also don't
> > seem to reproduce if nothing is being printed to the console at
> > suspend time, so this is presumably why they were not noticed until
> > now.
>
> Hi,
>
> This was quite tiring to read. :-) It's has lots of useful information but
> it could be structured better.
>
> Could you try to rewrite this entire description so that it's easier to
> find the problem and final solution information from it. Start with those
> two things, and in that part, try to avoid detouring to extra branches you
> took while finding and solving the problem.
>
> You can place how the problem can be reproduced after you've described the
> root cause & final solution first. Extra information why some other
> approaches do not work is also useful information, but please place it
> after the final solution has been covered first.
>
> Also, try to avoid I/you/we, use imperative tone.
Sure. I'll try. It's always a tradeoff between providing too much
information and not providing enough. In general I find that providing
the thought process can help someone else who is likely going to go
through the same thing as they're trying to understand the patch, but
I agree it can also be overwhelming.
Sure. I've attempted to use the imperative tone when possible. In
general (unless my understanding is flawed) it's not possible to use
imperative when explaining to the reader how the hardware/driver works
or what the problem is and (IMO) we shouldn't fully remove these types
of explanations from the commit message. When describing what the
patch actually does, though, I've tried to make sure it's in
imperative form. If you have wording changes on v3 then please suggest
specific changes.
-Doug