Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

From: Greg KH
Date: Sat Sep 10 2016 - 07:18:49 EST


On Fri, Sep 09, 2016 at 10:39:32AM -0700, Todd Kjos wrote:
> On Fri, Sep 9, 2016 at 8:44 AM, Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> > On Fri, Sep 09, 2016 at 08:17:44AM -0700, Todd Kjos wrote:
> >> From: Todd Kjos <tkjos@xxxxxxxxxxx>
> >>
> >> In Android systems, the display pipeline relies on low
> >> latency binder transactions and is therefore sensitive to
> >> delays caused by contention for the global binder lock.
> >> Jank is significantly reduced by disabling preemption
> >> while the global binder lock is held.
> >
> > What is the technical definition of "Jank"? :)
>
> I'll rephrase in the next version to "dropped or delayed frames".

Heh, thanks :)

Also in the next version can you fix the errors found by the 0-day build
bot?

> >> This patch was originated by Riley Andrews <riandrews@xxxxxxxxxxx>
> >> with tweaks and forward-porting by me.
> >>
> >> Originally-from: Riley Andrews <riandrews@xxxxxxxxxxx>
> >> Signed-off-by: Todd Kjos <tkjos@xxxxxxxxxxx>
> >> ---
> >> drivers/android/binder.c | 194 +++++++++++++++++++++++++++++++++++------------
> >> 1 file changed, 146 insertions(+), 48 deletions(-)
> >>
> >> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
> >> index 16288e7..c36e420 100644
> >> --- a/drivers/android/binder.c
> >> +++ b/drivers/android/binder.c
> >> @@ -379,6 +379,7 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags)
> >> struct files_struct *files = proc->files;
> >> unsigned long rlim_cur;
> >> unsigned long irqs;
> >> + int ret;
> >>
> >> if (files == NULL)
> >> return -ESRCH;
> >> @@ -389,7 +390,11 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags)
> >> rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE);
> >> unlock_task_sighand(proc->tsk, &irqs);
> >>
> >> - return __alloc_fd(files, 0, rlim_cur, flags);
> >> + preempt_enable_no_resched();
> >> + ret = __alloc_fd(files, 0, rlim_cur, flags);
> >> + preempt_disable();
> >> +
> >> + return ret;
> >> }
> >>
> >> /*
> >> @@ -398,8 +403,11 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags)
> >> static void task_fd_install(
> >> struct binder_proc *proc, unsigned int fd, struct file *file)
> >> {
> >> - if (proc->files)
> >> + if (proc->files) {
> >> + preempt_enable_no_resched();
> >> __fd_install(proc->files, fd, file);
> >> + preempt_disable();
> >> + }
> >> }
> >>
> >> /*
> >> @@ -427,6 +435,7 @@ static inline void binder_lock(const char *tag)
> >> {
> >> trace_binder_lock(tag);
> >> mutex_lock(&binder_main_lock);
> >> + preempt_disable();
> >> trace_binder_locked(tag);
> >> }
> >>
> >> @@ -434,8 +443,65 @@ static inline void binder_unlock(const char *tag)
> >> {
> >> trace_binder_unlock(tag);
> >> mutex_unlock(&binder_main_lock);
> >> + preempt_enable();
> >> +}
> >> +
> >> +static inline void *kzalloc_nopreempt(size_t size)
> >> +{
> >> + void *ptr;
> >> +
> >> + ptr = kzalloc(size, GFP_NOWAIT);
> >> + if (ptr)
> >> + return ptr;
> >> +
> >> + preempt_enable_no_resched();
> >> + ptr = kzalloc(size, GFP_KERNEL);
> >> + preempt_disable();
> >
> > Doesn't the allocator retry if the first one fails anyway? Why not
> > GFP_NOIO or GFP_ATOMIC? Have you really hit the second GFP_KERNEL
> > usage?
>
> I suspect we have hit the second, since we do get into cases where
> direct reclaim is needed. I can't confirm since I haven't instrumented
> this case. As you say, if we use GFP_ATOMIC instead, maybe we
> wouldn't, but even then I'd be concerned that we could deplete the
> memory reserved for atomic. The general idea of trying for a fast,
> nowait allocation and then enabling preempt for the rare potentially
> blocking allocation seems reasonable, doesn't it?

Yes it is, so much so that I think there's a generic kernel function for
it already. Adding in the linux-mm mailing list to be told that I'm
wrong about this :)

> >> + return ptr;
> >> +}
> >> +
> >> +static inline long copy_to_user_nopreempt(void __user *to,
> >> + const void *from, long n)
> >> +{
> >> + long ret;
> >> +
> >> + preempt_enable_no_resched();
> >> + ret = copy_to_user(to, from, n);
> >> + preempt_disable();
> >> + return ret;
> >> +}
> >> +
> >> +static inline long copy_from_user_nopreempt(void *to,
> >> + const void __user *from,
> >> + long n)
> >> +{
> >> + long ret;
> >> +
> >> + preempt_enable_no_resched();
> >> + ret = copy_from_user(to, from, n);
> >> + preempt_disable();
> >> + return ret;
> >> }
> >>
> >> +#define get_user_nopreempt(x, ptr) \
> >> +({ \
> >> + int __ret; \
> >> + preempt_enable_no_resched(); \
> >> + __ret = get_user(x, ptr); \
> >> + preempt_disable(); \
> >> + __ret; \
> >> +})
> >> +
> >> +#define put_user_nopreempt(x, ptr) \
> >> +({ \
> >> + int __ret; \
> >> + preempt_enable_no_resched(); \
> >> + __ret = put_user(x, ptr); \
> >> + preempt_disable(); \
> >> + __ret; \
> >> +})
> >
> > Any reason some of these are #defines and some are static inline
> > functions?
>
> Not that I know of. I'll change to static inline for the next version.
>
> >
> > Anyway, these all seem a bit strange to me, what type of latency spikes
> > are you seeing that these changes resolve? Shouldn't that be an issue
> > with the scheduler more than just the binder driver?
>
> Well, it certainly is partially a scheduler issue in the sense that we
> are using SCHED_NORMAL for latency sensitive tasks that comprise the
> display pipeline (surfaceflinger, UI threads, Render threads). This
> means that a task being preempted might have to wait for a while
> before getting scheduled again. It would be nice to use SCHED_FIFO for
> these, but at the moment we can't do that since that opens us up to
> user code DOS'ing cores. There are some ideas being investigated here,
> but that's a ways off.
>
> It is also partially a binder issue since there is a single global
> mutex that is acquired for any binder transaction. There are periods
> where the IPC rate is very high. Most of those binder calls are not so
> latency sensitive that a little contention for the binder lock
> matters, but the binder calls between RenderThreads and Surfaceflinger
> are sensitive since there are frame deadlines that need to be met. The
> more time spent holding the binder lock, the higher probability for
> contention and resulting delays which can cause us to miss the
> deadline for the frame.
>
> Without this change, it is common for the task holding the mutex be
> preempted, creating a backlog of tasks waiting on the mutex. We
> measured the lock being held for over 400us regularly in these cases,
> greatly increasing the likelyhood of contention for the mutex. Since
> there a several binder calls in the pipeline as part of drawing a
> frame, this was observed to induce more than 10ms of frame delay and
> sometimes as high as 20ms which translates directly to missed frames.
> This change fixed that issue.

This single-mutex problem has always bothered me about binder, and it's
just going to get worse with the upcoming changes in the Android systems
to rely more on binder.

So perhaps that should be fixed instead of messing around with
scheduling issues? Messing with the scheduler is just going to cause
more and more problems over time, and make it harder to have to "tune" a
system for good "jank" given the horrid hacks the vendors already do to
the kernel scheduler that aren't upstream :(

Can we break the mutex up a bit? That seems like the best solution in
the long run (becides moving to a bus1-based-solution which is probably
much longer term...)

> > I don't know of any other driver or IPC that does this type of thing
> > with the scheduler in order to make things "go faster", so it feels
> > wrong to me, and is probably why we don't have global functions like
> > put_user_nopreempt() :)
>
> I suspect that for most folks, "go faster" means "increase
> throughput". In this case "go faster" means "decrease frame render
> times" (or, more precisely "meet frame deadlines"). Maybe binder is
> special.

Yes, binder is unique and special, just like everyone else :)

"meet frame deadlines" is a very good and specific measurement, so you
are doing well there, but messing with the preempt logic to try to
achieve this seems like an odd solution when the lock seems like a
better long-term issue to be solved.

> > And is enabling and disabling preemption around single byte copies
> > to/from userspace really a good idea? That seems like a lot of overhead
> > you are now adding to your "fastpath" that you need to go even faster.
>
> Don't the user copy functions assume that preemption is enabled?

Yes. Well, usually, but not always, they are messy :)

> If it is safe to avoid the toggling here, then I'd agree that it isn't
> a "good idea". The overhead of this toggle is in the noise compared a
> task being preempted while holding the binder lock.

Yes, but you are doing this just for single character read/writes, the
set-up/tear-down for doing that seems huge, shouldn't binder look to
make those be a bit more "batched"? That might improve the throughput
as well, as it would reduce the amount of time the global mutex was
held.

thanks,

greg k-h