Re: [PATCH v13 3/3] binder: add transaction latency tracer

From: Frankie Chang
Date: Wed Nov 11 2020 - 10:59:14 EST


On Wed, 2020-11-11 at 16:12 +0100, Greg Kroah-Hartman wrote:
> On Wed, Nov 11, 2020 at 11:03:06PM +0800, Frankie Chang wrote:
> > On Wed, 2020-11-11 at 08:34 +0100, Greg Kroah-Hartman wrote:
> > > > - The reason why printing the related information to
> > > > kernel information log but not trace buffer is that
> > > > some abnormal transactions may be pending for a long
> > > > time ago, they could not be recorded due to buffer
> > > > limited.
> > >
> > > Don't abuse the kernel information log for stuff that is just normal
> > > operations. What is wrong with using the trace buffers here? That's
> > > what they are designed for from what I can tell.
> > >
> > As mentioned before, time limitation of recording is the reason why we
> > don't just use trace here.
>
> What limitation?
>
> > In some long time stability test, such as MTBF,
>
> What is "MTBF"?
>
Mean time between failures (MTBF) is the predicted elapsed time between
inherent failures of a mechanical or electronic system, during normal
system operation.

And we use MTBF script to run long time stress test to
make sure our product stability is no problem.

> > the exception is caused by a series of transactions interaction.
> > Some abnormal transactions may be pending for a long time ago, they
> > could not be recorded due to buffer limited.
>
> How long of a time is this? If they are pending, only when the timeout
> happens is the trace logged, right?
>
> Again, please do not abuse the kernel log for this, that is not what it
> is for.
>
Hmm..Do you mean that make timeout log print to trace buffer but not
kernel log?

The reason We don't do this is that we need to enable these trace events
and enable trace everytimes before testing. But our testing script could
lead to device reboot, and then it continue testing after reboot. The
reboot would make these trace events disable, and we cannot get the
timeout log which happen after reboot.

> > > > +config BINDER_TRANSACTION_LATENCY_TRACKING
> > > > + tristate "Android Binder transaction tracking"
> > > > + help
> > > > + Used for track abnormal binder transaction which is over threshold,
> > > > + when the transaction is done or be free, this transaction would be
> > > > + checked whether it executed overtime.
> > > > + If yes, printing out the detailed info.
> > >
> > > Why is this a separate module? Who will ever want this split out?
> > >
> > The reason we split out a separate module is that we adopted the
> > previously discussed recommendations in PATCH v1.
> >
> > This way all of this tracing code is in-kernel but outside of binder.c.
>
> Putting it in a single file is fine, but what does this benifit doing it
> in a separate file? Doesn't it waste more codespace this way?
>
Yeah, but I think separate file may be more manageable.

> > > > +/*
> > > > + * The reason setting the binder_txn_latency_threshold to 2 sec
> > > > + * is that most of timeout abort is greater or equal to 2 sec.
> > > > + * Making it configurable to let all users determine which
> > > > + * threshold is more suitable.
> > > > + */
> > > > +static uint32_t binder_txn_latency_threshold = 2;
> > > > +module_param_named(threshold, binder_txn_latency_threshold,
> > > > + uint, 0644);
> > >
> > > Again, this isn't the 1990's, please do not add module parameters if at
> > > all possible.
> > >
> >
> > Is any recommended method here?
> > Because we refer to the method in binder.c, we don't know if this method
> > is not suitable.
>
> Look at the individual binder instances. That is what trace should be
> on/off for, not for all binder instances in the system at the same time.
>
But our testing script is not for specific binder instances, it includes
several testing and tests for several processes.

So the reason we trace for all binder instances is that we cannot
predict which process would happen timeout transaction.

Many thanks
Frankie Chang