Re: [patch V3 12/18] posix-timers: Improve hash table performance

From: Frederic Weisbecker
Date: Tue Mar 11 2025 - 09:44:55 EST

Next message: Charles Keepax: "Re: [PATCH v2] ASoC: codecs: wm0010: Fix error handling path in wm0010_spi_probe()"
Previous message: Krzysztof Kozlowski: "Re: [PATCH v3 19/21] drm/msm/dpu: Implement CTL_PIPE_ACTIVE for v12.0 DPU"
In reply to: Thomas Gleixner: "[patch V3 12/18] posix-timers: Improve hash table performance"
Next in thread: tip-bot2 for Thomas Gleixner: "[tip: timers/core] posix-timers: Improve hash table performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Le Sat, Mar 08, 2025 at 05:48:38PM +0100, Thomas Gleixner a écrit :
> Eric and Ben reported a significant performance bottleneck on the global
> hash, which is used to store posix timers for lookup.
>
> Eric tried to do a lockless validation of a new timer ID before trying to
> insert the timer, but that does not solve the problem.
>
> For the non-contended case this is a pointless exercise and for the
> contended case this extra lookup just creates enough interleaving that all
> tasks can make progress.
>
> There are actually two real solutions to the problem:
>
> 1) Provide a per process (signal struct) xarray storage
>
> 2) Implement a smarter hash like the one in the futex code
>
> #1 works perfectly fine for most cases, but the fact that CRIU enforced a
> linear increasing timer ID to restore timers makes this problematic.
>
> It's easy enough to create a sparse timer ID space, which amounts very
> fast to a large junk of memory consumed for the xarray. 2048 timers with
> a ID offset of 512 consume more than one megabyte of memory for the
> xarray storage.
>
> #2 The main advantage of the futex hash is that it uses per hash bucket
> locks instead of a global hash lock. Aside of that it is scaled
> according to the number of CPUs at boot time.
>
> Experiments with artifical benchmarks have shown that a scaled hash with
> per bucket locks comes pretty close to the xarray performance and in some
> scenarios it performes better.
>
> Test 1:
>
> A single process creates 20000 timers and afterwards invokes
> timer_getoverrun(2) on each of them:
>
> mainline Eric newhash xarray
> create 23 ms 23 ms 9 ms 8 ms
> getoverrun 14 ms 14 ms 5 ms 4 ms
>
> Test 2:
>
> A single process creates 50000 timers and afterwards invokes
> timer_getoverrun(2) on each of them:
>
> mainline Eric newhash xarray
> create 98 ms 219 ms 20 ms 18 ms
> getoverrun 62 ms 62 ms 10 ms 9 ms
>
> Test 3:
>
> A single process creates 100000 timers and afterwards invokes
> timer_getoverrun(2) on each of them:
>
> mainline Eric newhash xarray
> create 313 ms 750 ms 48 ms 33 ms
> getoverrun 261 ms 260 ms 20 ms 14 ms
>
> Erics changes create quite some overhead in the create() path due to the
> double list walk, as the main issue according to perf is the list walk
> itself. With 100k timers each hash bucket contains ~200 timers, which in
> the worst case need to be all inspected. The same problem applies for
> getoverrun() where the lookup has to walk through the hash buckets to find
> the timer it is looking for.
>
> The scaled hash obviously reduces hash collisions and lock contention
> significantly. This becomes more prominent with concurrency.
>
> Test 4:
>
> A process creates 63 threads and all threads wait on a barrier before
> each instance creates 20000 timers and afterwards invokes
> timer_getoverrun(2) on each of them. The threads are pinned on
> seperate CPUs to achive maximum concurrency. The numbers are the
> average times per thread:
>
> mainline Eric newhash xarray
> create 180239 ms 38599 ms 579 ms 813 ms
> getoverrun 2645 ms 2642 ms 32 ms 7 ms
>
> Test 5:
>
> A process forks 63 times and all forks wait on a barrier before each
> instance creates 20000 timers and afterwards invokes
> timer_getoverrun(2) on each of them. The processes are pinned on
> seperate CPUs to achive maximum concurrency. The numbers are the
> average times per process:
>
> mainline eric newhash xarray
> create 157253 ms 40008 ms 83 ms 60 ms
> getoverrun 2611 ms 2614 ms 40 ms 4 ms
>
> So clearly the reduction of lock contention with Eric's changes makes a
> significant difference for the create() loop, but it does not mitigate the
> problem of long list walks, which is clearly visible on the getoverrun()
> side because that is purely dominated by the lookup itself. Once the timer
> is found, the syscall just reads from the timer structure with no other
> locks or code paths involved and returns.
>
> The reason for the difference between the thread and the fork case for the
> new hash and the xarray is that both suffer from contention on
> sighand::siglock and the xarray suffers additionally from contention on the
> xarray lock on insertion.
>
> The only case where the reworked hash slighly outperforms the xarray is a
> tight loop which creates and deletes timers.
>
> Test 4:
>
> A process creates 63 threads and all threads wait on a barrier before
> each instance runs a loop which creates and deletes a timer 100000
> times in a row. The threads are pinned on seperate CPUs to achive
> maximum concurrency. The numbers are the average times per thread:
>
> mainline Eric newhash xarray
> loop 5917 ms 5897 ms 5473 ms 7846 ms
>
> Test 5:
>
> A process forks 63 times and all forks wait on a barrier before each
> each instance runs a loop which creates and deletes a timer 100000
> times in a row. The processes are pinned on seperate CPUs to achive
> maximum concurrency. The numbers are the average times per process:
>
> mainline Eric newhash xarray
> loop 5137 ms 7828 ms 891 ms 872 ms
>
> In both test there is not much contention on the hash, but the ucount
> accounting for the signal and in the thread case the sighand::siglock
> contention (plus the xarray locking) contribute dominantly to the overhead.
>
> As the memory consumption of the xarray in the sparse ID case is
> significant, the scaled hash with per bucket locks seems to be the better
> overall option. While the xarray has faster lookup times for a large number
> of timers, the actual syscall usage, which requires the lookup is not an
> extreme hotpath. Most applications utilize signal delivery and all syscalls
> except timer_getoverrun(2) are all but cheap.
>
> So implement a scaled hash with per bucket locks, which offers the best
> tradeoff between performance and memory consumption.
>
> Reported-by: Eric Dumazet <edumazet@xxxxxxxxxx>
> Reported-by: Benjamin Segall <bsegall@xxxxxxxxxx>
> Signed-off-by: Thomas Gleixner <tglx@xxxxxxxxxxxxx>

Acked-by: Frederic Weisbecker <frederic@xxxxxxxxxx>

Next message: Charles Keepax: "Re: [PATCH v2] ASoC: codecs: wm0010: Fix error handling path in wm0010_spi_probe()"
Previous message: Krzysztof Kozlowski: "Re: [PATCH v3 19/21] drm/msm/dpu: Implement CTL_PIPE_ACTIVE for v12.0 DPU"
In reply to: Thomas Gleixner: "[patch V3 12/18] posix-timers: Improve hash table performance"
Next in thread: tip-bot2 for Thomas Gleixner: "[tip: timers/core] posix-timers: Improve hash table performance"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]