Re: [PATCH perf/core 4/4] uprobes: reuse return_instances between multiple uretprobes within task

From: Jiri Olsa
Date: Fri Dec 06 2024 - 19:37:12 EST


On Fri, Dec 06, 2024 at 10:00:16AM -0800, Andrii Nakryiko wrote:
> On Fri, Dec 6, 2024 at 6:07 AM Jiri Olsa <olsajiri@xxxxxxxxx> wrote:
> >
> > On Thu, Dec 05, 2024 at 04:24:17PM -0800, Andrii Nakryiko wrote:
> >
> > SNIP
> >
> > > +static void free_ret_instance(struct uprobe_task *utask,
> > > + struct return_instance *ri, bool cleanup_hprobe)
> > > +{
> > > + unsigned seq;
> > > +
> > > if (cleanup_hprobe) {
> > > enum hprobe_state hstate;
> > >
> > > @@ -1897,8 +1923,22 @@ static void free_ret_instance(struct return_instance *ri, bool cleanup_hprobe)
> > > hprobe_finalize(&ri->hprobe, hstate);
> > > }
> > >
> > > - kfree(ri->extra_consumers);
> > > - kfree_rcu(ri, rcu);
> > > + /*
> > > + * At this point return_instance is unlinked from utask's
> > > + * return_instances list and this has become visible to ri_timer().
> > > + * If seqcount now indicates that ri_timer's return instance
> > > + * processing loop isn't active, we can return ri into the pool of
> > > + * to-be-reused return instances for future uretprobes. If ri_timer()
> > > + * happens to be running right now, though, we fallback to safety and
> > > + * just perform RCU-delated freeing of ri.
> > > + */
> > > + if (raw_seqcount_try_begin(&utask->ri_seqcount, seq)) {
> > > + /* immediate reuse of ri without RCU GP is OK */
> > > + ri_pool_push(utask, ri);
> >
> > should the push be limitted somehow? I wonder you could make uprobes/consumers
> > setup that would allocate/push many of ri instances that would not be freed
> > until the process exits?
>
> So I'm just relying on the existing MAX_URETPROBE_DEPTH limit that is
> enforced by prepare_uretprobe anyways. But yes, we can have up to 64
> instances in ri_pool.
>
> I did consider cleaning this up from ri_timer() (that would be a nice
> properly, because ri_timer fires after 100ms of inactivity), and my
> initial version did use lockless llist for that, but there is a bit of
> a problem: llist doesn't support popping single iter from the list
> (you can only atomically take *all* of the items) in lockless way. So
> my implementation had to swap the entire list, take one element out of
> it, and then put N - 1 items back. Which, when there are deep chains
> of uretprobes, would be quite an unnecessary CPU overhead. And I
> clearly didn't want to add locking anywhere in this hot path, of
> course.
>
> So I figured that at the absolute worst case we'll just keep
> MAX_URETPROBE_DEPTH items in ri_pool until the task dies. That's not
> that much memory for a small subset of tasks on the system.
>
> One more idea I explored and rejected was to limit the size of ri_pool
> to something smaller than MAX_URETPROBE_DEPTH, say just 16. But then
> there is a corner case of high-frequency long chain of uretprobes up
> to 64 depth, then returning through all of them, and then going into
> the same set of functions again, up to 64. So depth oscillates between
> 0 and full 64. In this case this ri_pool will be causing allocation
> for the majority of those invocations, completely defeating the
> purpose.
>
> So, in the end, it felt like 64 cached instances (worst case, if we
> actually ever reached such a deep chain) would be acceptable.
> Especially that commonly I wouldn't expect more than 3-4, actually.
>
> WDYT?

ah ok, there's MAX_URETPROBE_DEPTH limit for task, 64 should be fine

thanks,
jirka

>
> >
> > jirka
> >
> > > + } else {
> > > + /* we might be racing with ri_timer(), so play it safe */
> > > + ri_free(ri);
> > > + }
> > > }
> > >
> > > /*
>
> [...]