Re: [PATCH v2] iopoll: use udelay() for initial polling

From: David Laight

Date: Fri May 29 2026 - 07:20:25 EST

On Thu, 28 May 2026 22:59:25 -0700
Peter Collingbourne <peter@xxxxxxxxx> wrote:

> On Fri, May 22, 2026 at 7:51 AM Mark Brown <broonie@xxxxxxxxxx> wrote:
> >
> > On Thu, May 21, 2026 at 10:03:28AM +0300, Jani Nikula wrote:
> > > On Wed, 20 May 2026, Peter Collingbourne <peter@xxxxxxxxx> wrote:
> >
> > > > That's what I had in v1; we decided this approach would better handle
> > > > misbehaving devices.
> > > > https://lore.kernel.org/all/20260517150253.031dec09@pumpkin/
> >
> > > I think the problem with trying to adapt to everything within
> > > read_poll_timeout() is that every step like this adapts to a *specific*
> > > use case, and once it gets specific enough, it's no longer usable to
> > > other scenarios.
> >
> > > Having to reimplement the whole thing in drivers is much worse than
> > > having to do two calls.
> >
> > > Could a staggered approach work?
> >
> > > ret = read_poll_timeout_atomic("short delay/timeout")
> > > if (ret)
> > > ret = read_poll_timeout("longer delay/timeout")
> >
> > > Then you have better control of the behaviour in the driver, instead of
> > > adapting a generic function to a specific use case.
> >
> > If we're doing that it still seems like it'd be better to have a helper
> > than just takes the two timeouts and does the right thing with them. My
> > original feedback here was that we should have helpers which encourage
> > good patterns, and TBH I expect there's probably a bunch of scenarios
> > where some fixed cutoff would be a worthwile improvement for very little
> > effort on the part of users.
>
> Thinking about it some more:
>
> Do we really want to udelay() for the initial busy wait? Assuming that
> we've decided to spend some time burning cycles, I think we may as
> well use those cycles to check the condition so that we have a chance
> of an early exit.
>
> As an experiment, I open coded it in the SPI NAND driver:
>
> diff --git a/drivers/spi/spi-mem.c b/drivers/spi/spi-mem.c
> index a09371a075d2..cce34ada7611 100644
> --- a/drivers/spi/spi-mem.c
> +++ b/drivers/spi/spi-mem.c
> @@ -1005,6 +1005,20 @@ int spi_mem_poll_status(struct spi_mem *mem,
> usleep_range((initial_delay_us >> 2) + 1,
> initial_delay_us);
>
> + {
> + ktime_t __start_time = ktime_get();
> + u64 __delay_timeout_us = 10;
> + ktime_t __delay_timeout =
> + ktime_add_us(__start_time, __delay_timeout_us);
> + while (ktime_compare(ktime_get(), __delay_timeout) <
> + 0) {
> + ret = spi_mem_read_status(mem, op, &status);
> + if (ret < 0)
> + return ret;
> + if ((status & mask) == match)
> + return 0;
> + }
> + }
> ret = read_poll_timeout(spi_mem_read_status, read_status_ret,
> (read_status_ret || ((status)
> & mask) == match),
> polling_delay_us, timeout_ms *
> 1000, false, mem,
>
> And it did in fact improve performance on my target compared to status
> quo and even v2.
>
> With hrtimers enabled, timing the UBI scan, 3 runs of each:
> - With no changes: 0.784s, 0.779s, 0.777s
> - With my v2: 0.687s, 0.686s, 0.687s
> - With the above patch: 0.665s, 0.665s, 0.665s
>
> So 3 options:
> 1. Take my v2 and change read_poll_timeout() to do the check during
> the busy wait.
> 2. Adjust read_poll_timeout_atomic() to busy wait using the check
> instead of udelay(), then change the SPI NAND driver to do the 2
> calls.
> 3. Introduce a new macro that does the 2 calls, and have the SPI NAND
> driver use it.

There is also the (difficult to measure) impact on overall system performance.
The versions that sleep let the cpu run other processes (or go to a low
power state).
Looping reading the status will load whatever bus handles the requests
between the cpu and spi interface logic (might be PCIe for example), that
could slow down accesses to other devices from other cpu.

I think I remember someone saying that the spi hardware interface normally
generates an interrupt when the request completes?
So for spi this is only fall-back code for a few systems.

Although I just looked at some code I have for doing bulk writes to an SPI
memory device (to write an fpga image).
That does about 35000 back to back writes.
The whole lot is in userspace - the driver just lets it mmap() the PCIe
registers. And the PCIe slave just converts 32bit accesses into 8 4-bit ones.
(I really should have supported read/write of AVX registers to reduce the
latency of reads - but reads are only really used for verify and I got the
CRC64 logic working in the end.)
The code has this comment:
/* Wait for the write - typically 0.6ms (max 5ms).
* In spite of the datasheet values, I'm seeing 200us writes. */
It waits 200us and then polls every 50us for 2 seconds.

-- David

>
> I don't have a strong opinion, but my preference would be either 1 or
> 3. I agree with Mark that it would make the preferred usage code
> easier to write (and to read), as there would be no need to duplicate
> the argument list or compute the timeouts correctly in every caller.
>
> Peter