Re: [PATCH v13 3/3] binder: add transaction latency tracer
From: Greg Kroah-Hartman
Date: Fri Nov 13 2020 - 10:44:38 EST
On Wed, Nov 11, 2020 at 11:59:05PM +0800, Frankie Chang wrote:
> On Wed, 2020-11-11 at 16:12 +0100, Greg Kroah-Hartman wrote:
> > On Wed, Nov 11, 2020 at 11:03:06PM +0800, Frankie Chang wrote:
> > > On Wed, 2020-11-11 at 08:34 +0100, Greg Kroah-Hartman wrote:
> > > > > - The reason why printing the related information to
> > > > > kernel information log but not trace buffer is that
> > > > > some abnormal transactions may be pending for a long
> > > > > time ago, they could not be recorded due to buffer
> > > > > limited.
> > > >
> > > > Don't abuse the kernel information log for stuff that is just normal
> > > > operations. What is wrong with using the trace buffers here? That's
> > > > what they are designed for from what I can tell.
> > > >
> > > As mentioned before, time limitation of recording is the reason why we
> > > don't just use trace here.
> >
> > What limitation?
> >
> > > In some long time stability test, such as MTBF,
> >
> > What is "MTBF"?
> >
> Mean time between failures (MTBF) is the predicted elapsed time between
> inherent failures of a mechanical or electronic system, during normal
> system operation.
>
> And we use MTBF script to run long time stress test to
> make sure our product stability is no problem.
Ok, great.
> > > the exception is caused by a series of transactions interaction.
> > > Some abnormal transactions may be pending for a long time ago, they
> > > could not be recorded due to buffer limited.
> >
> > How long of a time is this? If they are pending, only when the timeout
> > happens is the trace logged, right?
> >
> > Again, please do not abuse the kernel log for this, that is not what it
> > is for.
> >
> Hmm..Do you mean that make timeout log print to trace buffer but not
> kernel log?
Yes.
> The reason We don't do this is that we need to enable these trace events
> and enable trace everytimes before testing. But our testing script could
> lead to device reboot, and then it continue testing after reboot. The
> reboot would make these trace events disable, and we cannot get the
> timeout log which happen after reboot.
When you reboot you enable tracing, that's not an issue, and then you
will see the messages. Nothing is going to work across reboots.
> > > > > +config BINDER_TRANSACTION_LATENCY_TRACKING
> > > > > + tristate "Android Binder transaction tracking"
> > > > > + help
> > > > > + Used for track abnormal binder transaction which is over threshold,
> > > > > + when the transaction is done or be free, this transaction would be
> > > > > + checked whether it executed overtime.
> > > > > + If yes, printing out the detailed info.
> > > >
> > > > Why is this a separate module? Who will ever want this split out?
> > > >
> > > The reason we split out a separate module is that we adopted the
> > > previously discussed recommendations in PATCH v1.
> > >
> > > This way all of this tracing code is in-kernel but outside of binder.c.
> >
> > Putting it in a single file is fine, but what does this benifit doing it
> > in a separate file? Doesn't it waste more codespace this way?
> >
> Yeah, but I think separate file may be more manageable.
Sorry, I mean, "why put this in a separate kernel module", not file.
File is fine.
> > > > > +/*
> > > > > + * The reason setting the binder_txn_latency_threshold to 2 sec
> > > > > + * is that most of timeout abort is greater or equal to 2 sec.
> > > > > + * Making it configurable to let all users determine which
> > > > > + * threshold is more suitable.
> > > > > + */
> > > > > +static uint32_t binder_txn_latency_threshold = 2;
> > > > > +module_param_named(threshold, binder_txn_latency_threshold,
> > > > > + uint, 0644);
> > > >
> > > > Again, this isn't the 1990's, please do not add module parameters if at
> > > > all possible.
> > > >
> > >
> > > Is any recommended method here?
> > > Because we refer to the method in binder.c, we don't know if this method
> > > is not suitable.
> >
> > Look at the individual binder instances. That is what trace should be
> > on/off for, not for all binder instances in the system at the same time.
> >
> But our testing script is not for specific binder instances, it includes
> several testing and tests for several processes.
Then turn it on for all binder instances.
> So the reason we trace for all binder instances is that we cannot
> predict which process would happen timeout transaction.
That's fine, but provide the ability to do this on a per-instance, as
that is what binder now supports. To make it "global" is a big
regression from the way it was changed recently to support different
instances.
thanks,
greg k-h