Re: [PATCH 2/2] trace/kprobe: Remove limit on kretprobe maxactive

From: Naveen N. Rao
Date: Thu Jun 17 2021 - 12:20:28 EST

Masami Hiramatsu wrote:
On Tue, 15 Jun 2021 23:11:27 +0530
"Naveen N. Rao" <naveen.n.rao@xxxxxxxxxxxxxxxxxx> wrote:

Masami Hiramatsu wrote:
> On Mon, 14 Jun 2021 23:33:29 +0530
> "Naveen N. Rao" <naveen.n.rao@xxxxxxxxxxxxxxxxxx> wrote:
> >> We currently limit maxactive for a kretprobe to 4096 when registering
>> the same through tracefs. The comment indicates that this is done so as
>> to keep list traversal reasonable. However, we don't ever iterate over
>> all kretprobe_instance structures. The core kprobes infrastructure also
>> imposes no such limitation.
>> >> Remove the limit from the tracefs interface. This limit is easy to hit
>> on large cpu machines when tracing functions that can sleep.
>> >> Reported-by: Anton Blanchard <anton@xxxxxxxxxx>
>> Signed-off-by: Naveen N. Rao <naveen.n.rao@xxxxxxxxxxxxxxxxxx>
> > OK, but I don't like to just remove the limit (since it can cause
> memory shortage easily.)
> Can't we make it configurable? I don't mean Kconfig, but > tracefs/options/kretprobe_maxactive, or kprobes's debugfs knob.
> > Hmm, maybe debugfs/kprobes/kretprobe_maxactive will be better since
> it can limit both trace_kprobe and kprobes itself.

I don't think it is good to put a new tunable in debugfs -- we don't have any kprobes tunable there, so this adds a dependency on debugfs which shouldn't be necessary.

/proc/sys/debug/ may be a better fit since we have the kprobes-optimization flag to disable optprobes there, though I'm not sure if a new sysfs file is agreeable.


But, I'm not too sure this really is a problem. Maxactive is a user _opt-in_ feature which needs to be explicitly added to an event definition. In that sense, isn't this already a tunable?

Let me explain the background of the limiation.

Thanks for the background on this.

Maxactive is currently no limit for the kprobe kernel module API,
because the kernel module developer must take care of the max memory
usage (and they can).

But the tracefs user may NOT have enough information about what
happens if they pass something like 10M for maxactive (it will consume
around 500MB kernel memory for one kretprobe).

Ok, thinking more about this...

Right now, the only way for a user to notice that kretprobe maxactive is an issue is by looking at kprobe_profile. This is not even possible if using a bcc tool, which uses perf_event_open(). It took the reporting team some effort to even identify that the reason why they were getting weird results when tracing was due to the default value used for kretprobe maxactive; and then that 4096 was the hard limit through tracefs.

So, IMO, anyone using any existing bcc tool, or a pre-canned perf script will not even be able to identify this as a problem to begin with... at least, not without some effort.

To address this, as a first step, we should probably consider parsing kprobe_profile and printing a warning with 'perf' if we detect a non-zero miss count for a probe -- both a regular probe, as well as a retprobe.

If we do this, the nice thing with kprobe_profile is that the probe miss count is available, and can serve as a good way to decide what a more reasonable maxactive value should be. This should help prevent users from trying with arbitrary maxactive values.

For perf_event_open(), perhaps we can introduce an ioctl to query the probe miss count.

To avoid such trouble, I had set the 4096 limitation for the maxactive
parameter. Of course 4096 may not enough for some use-cases. I'm welcome
to expand it (e.g. 32k, isn't it enough?), but removing the limitation
may cause OOM trouble easily.

Do you have suggestions for how we can determine a better limit? As you point out in the other email, there could very well be 64k or more processes on a large machine. Since the primary concern is memory usage, we probably need to decide this based on total memory. But, memory usage will vary depending on system load...

Perhaps we can start by making maxactive limit be a tunable with a default value of 4096, with the understanding that users will be careful when bumping up this value. Hopefully, scripts won't simply start writing into this file ;)

If we can feed back the probe miss count, tools should be able to guide users on what would be a reasonable maxactive value to use.