Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

From: Andrii Nakryiko
Date: Mon Jul 08 2024 - 13:47:49 EST


On Thu, Jul 4, 2024 at 8:44 AM Paul E. McKenney <paulmck@xxxxxxxxxx> wrote:
>
> On Thu, Jul 04, 2024 at 11:15:59AM +0200, Peter Zijlstra wrote:
> > On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:
> >
> > > 2. More tactically, RCU protection seems like the best way forward. We
> > > got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> > > clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> > > flavor (whatever that is, I have no idea).
> > >
> > > Now, RCU Tasks Trace were specifically designed for least overhead
> > > hotpath (reader side) performance, at the expense of slowing down much
> > > rarer writers. My microbenchmarking does show at least 5% difference.
> > > Both flavors can handle sleepable uprobes waiting for page faults.
> > > Tasks Trace flavor is already used for tracing in the BPF realm,
> > > including for sleepable uprobes and works well. It's not going away.
> >
> > I need to look into this new RCU flavour and why it exists -- for
> > example, why can't SRCU be improved to gain the same benefits. This is
> > what we've always done, improve SRCU.
>
> Well, it is all software. And I certainly pushed SRCU hard. If I recall
> correctly, it took them a year to convince me that they needed something
> more than SRCU could reasonably be convinced to do.
>
> The big problem is that they need to be able to hook a simple BPF program
> (for example, count the number of calls with given argument values) on
> a fastpath function on a system running in production without causing
> the automation to decide that this system is too slow, thus whacking it
> over the head. Any appreciable overhead is a no-go in this use case.
> It is not just that the srcu_read_lock() function's smp_mb() call would
> disqualify SRCU, its other added overhead would as well. Plus this needs
> RCU Tasks Trace CPU stall warnings to catch abuse, and SRCU doesn't
> impose any limits on readers (how long to set the stall time?) and
> doesn't track tasks.
>
> > > Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> > > haven't seen a single argument why. Please provide that, or let's
> > > stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> > > of what Tasks Trace flavor was designed for.
> >
> > Because I actually know SRCU, and because it provides a local scope.
> > It isolates the unregister waiters from other random users. I'm not
> > going to use this funky new flavour until I truly understand it.
>
> It is only a few hundred lines of code on top of the infrastructure
> that also supports RCU Tasks and RCU Tasks Rude. If you understand
> SRCU and preemptible RCU, there will be nothing exotic there, and it is
> simpler than Tree SRCU, to say nothing of preemptible RCU. I would be
> more than happy to take you through it if you would like, but not before
> this coming Monday.
>
> > Also, we actually want two scopes here, there is no reason for the
> > consumer unreg to wait for the retprobe stuff.
>
> I don't know that the performance requirements for userspace retprobes are
> as severe as for function-call probes -- on that, I must defer to Andrii.

uretprobes are just as important (performance-wise and just in term of
functionality), as they are often used simultaneously (e.g., to time
some user function or capture input args and make decision whether to
log them based on return value). uretprobes are inherently slower
(because they are entry probe + some extra bookkeeping and overhead),
but we should do the best we can to ensure they are as performant as
possible


> To your two-scopes point, it is quite possible that SRCU could be used
> for userspace retprobes and RCU Tasks Trace for the others. It certainly
> seems to me that SRCU would be better than explicit reference counting,
> but I could be missing something. (Memory footprint, perhaps? Though
> maybe a single srcu_struct could be shared among all userspace retprobes.
> Given the time-bounded reads, maybe stall warnings aren't needed,
> give or take things like interrupts, preemption, and vCPU preemption.
> Plus it is not like it would be hard to figure out which read-side code
> region was at fault when the synchronize_srcu() took too long.)
>
> Thanx, Paul
>
> > > 3. Regardless of RCU flavor, due to RCU protection, we have to add
> > > batched register/unregister APIs, so we can amortize sync_rcu cost
> > > during deregistration. Can we please agree on that as well? This is
> > > the main goal of this patch set and I'd like to land it before working
> > > further on changing and improving the rest of the locking schema.
> >
> > See my patch here:
> >
> > https://lkml.kernel.org/r/20240704084524.GC28838@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > I don't think it needs to be more complicated than that.
> >
> > > I won't be happy about it, but just to move things forward, I can drop
> > > a) custom refcounting and/or b) percpu RW semaphore. Both are
> > > beneficial but not essential for batched APIs work. But if you force
> > > me to do that, please state clearly your reasons/arguments.
> >
> > The reason I'm pushing RCU here is because AFAICT uprobes doesn't
> > actually need the stronger serialisation that rwlock (any flavour)
> > provide. It is a prime candidate for RCU, and I think you'll find plenty
> > papers / articles (by both Paul and others) that show that RCU scales
> > better.
> >
> > As a bonus, you avoid that horrific write side cost that per-cpu rwsem
> > has.
> >
> > The reason I'm not keen on that refcount thing was initially because I
> > did not understand the justification for it, but worse, once I did read
> > your justification, your very own numbers convinced me that the refcount
> > is fundamentally problematic, in any way shape or form.
> >
> > > No one had yet pointed out why refcounting is broken
> >
> > Your very own numbers point out that refcounting is a problem here.
> >
> > > and why percpu RW semaphore is bad.
> >
> > Literature and history show us that RCU -- where possible -- is
> > always better than any reader-writer locking scheme.
> >
> > > 4. Another tactical thing, but an important one. Refcounting schema
> > > for uprobes. I've replied already, but I think refcounting is
> > > unavoidable for uretprobes,
> >
> > I think we can fix that, I replied here:
> >
> > https://lkml.kernel.org/r/20240704083152.GQ11386@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> >
> > > and current refcounting schema is
> > > problematic for batched APIs due to race between finding uprobe and
> > > there still being a possibility we'd need to undo all that and retry
> > > again.
> >
> > Right, I've not looked too deeply at that, because I've not seen a
> > reason to actually change that. I can go think about it if you want, but
> > meh.