Re: [PATCH v2 00/12] uprobes: add batched register/unregister APIs and per-CPU RW semaphore

From: Andrii Nakryiko
Date: Mon Jul 08 2024 - 13:49:22 EST


On Thu, Jul 4, 2024 at 2:16 AM Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>
> On Wed, Jul 03, 2024 at 02:33:06PM -0700, Andrii Nakryiko wrote:
>
> > 2. More tactically, RCU protection seems like the best way forward. We
> > got hung up on SRCU vs RCU Tasks Trace. Thanks to Paul, we also
> > clarified that RCU Tasks Trace has nothing to do with Tasks Rude
> > flavor (whatever that is, I have no idea).
> >
> > Now, RCU Tasks Trace were specifically designed for least overhead
> > hotpath (reader side) performance, at the expense of slowing down much
> > rarer writers. My microbenchmarking does show at least 5% difference.
> > Both flavors can handle sleepable uprobes waiting for page faults.
> > Tasks Trace flavor is already used for tracing in the BPF realm,
> > including for sleepable uprobes and works well. It's not going away.
>
> I need to look into this new RCU flavour and why it exists -- for
> example, why can't SRCU be improved to gain the same benefits. This is
> what we've always done, improve SRCU.

Yes, that makes sense, in principle. But if it takes too much time to
improve SRCU, I'd say it's reasonable to use the faster solution until
it can be unified (if at all, of course).

>
> > Now, you keep pushing for SRCU instead of RCU Tasks Trace, but I
> > haven't seen a single argument why. Please provide that, or let's
> > stick to RCU Tasks Trace, because uprobe's use case is an ideal case
> > of what Tasks Trace flavor was designed for.
>
> Because I actually know SRCU, and because it provides a local scope.
> It isolates the unregister waiters from other random users. I'm not
> going to use this funky new flavour until I truly understand it.
>
> Also, we actually want two scopes here, there is no reason for the
> consumer unreg to wait for the retprobe stuff.
>

Uprobe attachment/detachment (i.e., register/unregister) is a very
rare operation. Its performance doesn't really matter in the great
scheme of things. In the sense that whether it takes 1, 10, or 200
milliseconds is immaterial compared to uprobe/uretprobe triggering
performance. The only important thing is that it doesn't take multiple
seconds and minutes (or even hours, if we do synchronize_rcu
unconditionally after each unregister) to attach/detach 100s/1000s+
uprobes.

I'm just saying this is the wrong target to optimize for if we just
ensure that it's reasonably performant in the face of multiple uprobes
registering/unregistering. (so one common SRCU scope for
registration/unregistration is totally fine, IMO)


> > 3. Regardless of RCU flavor, due to RCU protection, we have to add
> > batched register/unregister APIs, so we can amortize sync_rcu cost
> > during deregistration. Can we please agree on that as well? This is
> > the main goal of this patch set and I'd like to land it before working
> > further on changing and improving the rest of the locking schema.
>
> See my patch here:
>
> https://lkml.kernel.org/r/20240704084524.GC28838@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> I don't think it needs to be more complicated than that.

Alright, I'll take a closer look this week and will run it through my
tests and benchmarks, thanks for working on this and sending it out!

>
> > I won't be happy about it, but just to move things forward, I can drop
> > a) custom refcounting and/or b) percpu RW semaphore. Both are
> > beneficial but not essential for batched APIs work. But if you force
> > me to do that, please state clearly your reasons/arguments.
>
> The reason I'm pushing RCU here is because AFAICT uprobes doesn't
> actually need the stronger serialisation that rwlock (any flavour)
> provide. It is a prime candidate for RCU, and I think you'll find plenty
> papers / articles (by both Paul and others) that show that RCU scales
> better.
>
> As a bonus, you avoid that horrific write side cost that per-cpu rwsem
> has.
>
> The reason I'm not keen on that refcount thing was initially because I
> did not understand the justification for it, but worse, once I did read
> your justification, your very own numbers convinced me that the refcount
> is fundamentally problematic, in any way shape or form.
>
> > No one had yet pointed out why refcounting is broken
>
> Your very own numbers point out that refcounting is a problem here.

Yes, I already agreed on avoiding refcounting if possible. The
question above was why the refcounting I added was broken by itself.
But it's a moot point (at least for now), let me go look at your
patches.

>
> > and why percpu RW semaphore is bad.
>
> Literature and history show us that RCU -- where possible -- is
> always better than any reader-writer locking scheme.
>
> > 4. Another tactical thing, but an important one. Refcounting schema
> > for uprobes. I've replied already, but I think refcounting is
> > unavoidable for uretprobes,
>
> I think we can fix that, I replied here:
>
> https://lkml.kernel.org/r/20240704083152.GQ11386@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
>
> > and current refcounting schema is
> > problematic for batched APIs due to race between finding uprobe and
> > there still being a possibility we'd need to undo all that and retry
> > again.
>
> Right, I've not looked too deeply at that, because I've not seen a
> reason to actually change that. I can go think about it if you want, but
> meh.

Ok, let's postpone that if we can get away with just sync/nosync
uprobe_unregister.