Re: [PATCH v4 0/2] Improve the performance of bitmap_find_next_zero_area_off()

From: John Stultz

Date: Mon Jun 08 2026 - 22:15:34 EST

On Mon, Jun 8, 2026 at 6:06 PM Yury Norov <yury.norov@xxxxxxxxx> wrote:
> On Mon, Jun 08, 2026 at 05:54:20PM -0400, Yury Norov wrote:
> > On Mon, Jun 01, 2026 at 05:42:32PM +0800, Yi Sun wrote:
> > > Test code has been added to PATCH v2.
> > > No new APIs were introduced.
> > >
> > > Testing with the test code showed a performance improvement
> > > of approximately 70%.
> >
> > No, it's not. Your numbers show approximately 50% improvement for
> > the dense case, and approximately 2% slowdown for the sparse case.
> >
> > > Test result(random):
> > > orig_ns orig_cnt orig_average new_ns new_cnt new_average ratio
> > > test1 1388885 1154 1203 462923 1308 353 70.7%
> > > test2 1393616 1324 1052 736193 1212 607 42.3%
> > > test3 1391693 1216 1144 735808 1260 583 49%
> > > test4 1393231 1275 1092 742731 1402 529 51.6%
> > > test5 1390731 1260 1103 737231 1274 578 47.6%
> > >
> > > Test result(sparse):
> > > orig_ns orig_cnt orig_average new_ns new_cnt new_average ratio
> > > test1 4496077 322477 13 2419462 322480 7 46.2%
> > > test2 7514731 322482 23 5785808 322476 17 26.1%
> > > test3 7490692 322493 23 7654423 322483 23 0%
> > > test4 7474500 322469 23 7628230 322483 23 0%
> > > test5 7452692 322481 23 7663116 322478 23 0%
> >
> > The numbers look quite inconsistent. The first measurements are
> > significantly faster for almost all experiments. In the 'new sparse'
> > case the first run is 4 times faster than the others. And the ratio
> > 0% is simply wrong.
> >
> > Please, run the test on a real hardware, not virtualized. Please
> > built-in the test, so it's executed at boot time, or make sure you're
> > not running anything on parallel, like a GUI or networking.
> >
> > I gave your code a brief test on my qemu, and I have 43% improvement
> > in the dense case, with p-value 0.001; and -8% for sparse bitmap,
> > with the p-value 0.044, still significant.
> >
> > Overall not bad. But if some critical user has actually a sparse bitmap,
> > he'll be disappointed. There's not that many actual users of the
> > function. For v5, can you CC those from non-driver part, at least.
> >
> > (The ARM GIC counts as the non-driver, I believe.)
>
> OK, I traced the cma_alloc(), which calls the bitmap function through
> cma_range_alloc(), and the numbers are looking really strong:
>
> Metric Before After Change
> Trace span 194.0 ms 87.1 ms -55.1%
> Total CMA alloc time 48.46 ms 16.11 ms -66.8%
> Avg alloc latency 184.94 us 61.49 us -66.8%
> Median alloc latency 73.72 us 20.59 us -72.1%
> p90 alloc latency 329.76 us 55.63 us -83.1%
> p99 alloc latency 1866.76 us 859.83 us -53.9%
> Max alloc latency 4821.91 us 2324.41 us -51.8%
>
> By request size:
>
> Request Before Avg After Avg Change
> 1 page 79.68 us 34.47 us -56.7%
> 256 pages 285.50 us 87.30 us -69.4%
>
> I ran it on qemu, but the numbers are so impressive that I believe
> they will be reproduced baremetal.
>
> The tracing command is:
>
> sudo trace-cmd record \
> -o cma-dmabuf.dat \
> -b 65536 \
> -e cma:cma_alloc_start \
> -e cma:cma_alloc_finish \
> -e cma:cma_alloc_busy_retry \
> -e cma:cma_release \
> -- kselftest/dmabuf-heaps/dmabuf-heap
>
> Can you run it on your side before sending v5, and share your results?
>
> Adding John Stultz, the test author.
>
> Hi John.
>
> This series improves the underlying bitmap_find_next_zero_area_off()
> significantly for average bitmap, but shows ~8% slowdown for sparse
> bitmaps. With your CMA allocator test, the results are even stronger,
> comparing to the synthetic benchmark, and there seemingly are no
> drawbacks.
>
> Can you comment on the results and maybe reproduce it on your side?
> Are you or anyone aware of any other useful tests for CMA allocator?
> How important the sparse bitmap case overall?
>

Pulling in TJ who has more recent context here.

thanks
-john