Re: [PATCH] uprobes: reduce contention on uprobes_tree access

From: Jonthan Haslam
Date: Thu Apr 04 2024 - 06:46:13 EST


> > Things to note about the results:
> >
> > - The results are slightly variable so don't get too caught up on
> > individual thread count - it's the trend that is important.
> > - In terms of throughput with this specific benchmark a *very* macro view
> > is that the RW spinlock provides 40-60% more throughput than the
> > spinlock. The per-CPU RW semaphore provides in the order of 50-100%
> > more throughput then the spinlock.
> > - This doesn't fully reflect the large reduction in latency that we have
> > seen in application based measurements. However, it does demonstrate
> > that even the trivial change of going to a RW spinlock provides
> > significant benefits.
>
> This is probably because trig-uprobe-nop creates a single uprobe that
> is triggered on many CPUs. While in production we have also *many*
> uprobes running on many CPUs. In this benchmark, besides contention on
> uprobes_treelock, we are also hammering on other per-uprobe locks
> (register_rwsem, also if you don't have [0] patch locally, there will
> be another filter lock taken each time, filter->rwlock). There is also
> atomic refcounting going on, which when you have the same uprobe
> across all CPUs at the same time will cause a bunch of cache line
> bouncing.
>
> So yes, it's understandable that in practice in production you see an
> even larger effect of optimizing uprobe_treelock than in this
> micro-benchmark.
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace.git/commit/?h=probes/for-next&id=366f7afd3de31d3ce2f4cbff97c6c23b6aa6bcdf

Thanks for the reply and the thoughts on this Andrii. Yes, I do have the
filter->rwlock fix applied but, as you say, there are no doubt other
effects at play here as to be expected in such a synthetic workload. I'm
pleased with the outcomes though as they show a good result even if they
are at the lower end of what I expect.

The results also show that pursuing an RCU solution is definitely worth it
but that write penalty is brutal in the case of a full synchronize_rcu()!
Should be fun.

> > for num_threads in {1..20}
> > do
> > sudo ./bench -p $num_threads trig-uprobe-nop | grep Summary
>
> just want to mention -a (affinity) option that you can pass a bench
> tool, it will pin each thread on its own CPU. It generally makes tests
> more uniform, eliminating CPU migrations variability.

Thanks for pointing that flag out!

Jon.

>
> > done
> >
> >
> > spinlock
> >
> > Summary: hits 1.453 ± 0.005M/s ( 1.453M/prod)
> > Summary: hits 2.087 ± 0.005M/s ( 1.043M/prod)
> > Summary: hits 2.701 ± 0.012M/s ( 0.900M/prod)
>
> I also wanted to point out that the first measurement (1.453M/s in
> this row) is total throughput across all threads, while value in
> parenthesis (0.900M/prod) is averaged throughput per each thread. So
> this M/prod value is the most interesting in this benchmark where we
> assess the effect of reducing contention.
>
> > Summary: hits 1.917 ± 0.011M/s ( 0.479M/prod)
> > Summary: hits 2.105 ± 0.003M/s ( 0.421M/prod)
> > Summary: hits 1.615 ± 0.006M/s ( 0.269M/prod)
>
> [...]