Re: [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task

From: Andrii Nakryiko
Date: Fri Dec 06 2024 - 13:00:38 EST


On Fri, Dec 6, 2024 at 6:07 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
>
> On Thu, Dec 05, 2024 at 04:24:17PM -0800, Andrii Nakryiko wrote:
>
> SNIP
>
> > +static void free_ret_instance(struct uprobe_task *utask,
> > + struct return_instance *ri, bool cleanup_hprobe)
> > +{
> > + unsigned seq;
> > +
> > if (cleanup_hprobe) {
> > enum hprobe_state hstate;
> >
> > @@ -1897,8 +1923,22 @@ static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
> > hprobe_finalize(&ri->hprobe, hstate);
> > }
> >
> > - kfree(ri->extra_consumers);
> > - kfree_rcu(ri, rcu);
> > + /*
> > + * At this point return_instance is unlinked from utask's
> > + * return_instances list and this has become visible to ri_timer().
> > + * If seqcount now indicates that ri_timer's return instance
> > + * processing loop isn't active, we can return ri into the pool of
> > + * to-be-reused return instances for future uretprobes. If ri_timer()
> > + * happens to be running right now, though, we fallback to safety and
> > + * just perform RCU-delated freeing of ri.
> > + */
> > + if (raw_seqcount_try_begin(&utask->ri_seqcount, seq)) {
> > + /* immediate reuse of ri without RCU GP is OK */
> > + ri_pool_push(utask, ri);
>
> should the push be limitted somehow? I wonder you could make uprobes/consumers
> setup that would allocate/push many of ri instances that would not be freed
> until the process exits?

So I'm just relying on the existing MAX_URETPROBE_DEPTH limit that is
enforced by prepare_uretprobe anyways. But yes, we can have up to 64
instances in ri_pool.

I did consider cleaning this up from ri_timer() (that would be a nice
properly, because ri_timer fires after 100ms of inactivity), and my
initial version did use lockless llist for that, but there is a bit of
a problem: llist doesn't support popping single iter from the list
(you can only atomically take *all* of the items) in lockless way. So
my implementation had to swap the entire list, take one element out of
it, and then put N - 1 items back. Which, when there are deep chains
of uretprobes, would be quite an unnecessary CPU overhead. And I
clearly didn't want to add locking anywhere in this hot path, of
course.

So I figured that at the absolute worst case we'll just keep
MAX_URETPROBE_DEPTH items in ri_pool until the task dies. That's not
that much memory for a small subset of tasks on the system.

One more idea I explored and rejected was to limit the size of ri_pool
to something smaller than MAX_URETPROBE_DEPTH, say just 16. But then
there is a corner case of high-frequency long chain of uretprobes up
to 64 depth, then returning through all of them, and then going into
the same set of functions again, up to 64. So depth oscillates between
0 and full 64. In this case this ri_pool will be causing allocation
for the majority of those invocations, completely defeating the
purpose.

So, in the end, it felt like 64 cached instances (worst case, if we
actually ever reached such a deep chain) would be acceptable.
Especially that commonly I wouldn't expect more than 3-4, actually.

WDYT?

>
> jirka
>
> > + } else {
> > + /* we might be racing with ri_timer(), so play it safe */
> > + ri_free(ri);
> > + }
> > }
> >
> > /*

[...]