Re: [PATCH] android: binder: Disable preemption while holding the global binder lock

From: Todd Kjos
Date: Fri Sep 09 2016 - 13:39:40 EST


On Fri, Sep 9, 2016 at 8:44 AM, Greg KH <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Sep 09, 2016 at 08:17:44AM -0700, Todd Kjos wrote:
>> From: Todd Kjos <tkjos@xxxxxxxxxxx>
>>
>> In Android systems, the display pipeline relies on low
>> latency binder transactions and is therefore sensitive to
>> delays caused by contention for the global binder lock.
>> Jank is significantly reduced by disabling preemption
>> while the global binder lock is held.
>
> What is the technical definition of "Jank"? :)

I'll rephrase in the next version to "dropped or delayed frames".

>
>>
>> This patch was originated by Riley Andrews <riandrews@xxxxxxxxxxx>
>> with tweaks and forward-porting by me.
>>
>> Originally-from: Riley Andrews <riandrews@xxxxxxxxxxx>
>> Signed-off-by: Todd Kjos <tkjos@xxxxxxxxxxx>
>> ---
>> drivers/android/binder.c | 194 +++++++++++++++++++++++++++++++++++------------
>> 1 file changed, 146 insertions(+), 48 deletions(-)
>>
>> diff --git a/drivers/android/binder.c b/drivers/android/binder.c
>> index 16288e7..c36e420 100644
>> --- a/drivers/android/binder.c
>> +++ b/drivers/android/binder.c
>> @@ -379,6 +379,7 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags)
>> struct files_struct *files = proc->files;
>> unsigned long rlim_cur;
>> unsigned long irqs;
>> + int ret;
>>
>> if (files == NULL)
>> return -ESRCH;
>> @@ -389,7 +390,11 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags)
>> rlim_cur = task_rlimit(proc->tsk, RLIMIT_NOFILE);
>> unlock_task_sighand(proc->tsk, &irqs);
>>
>> - return __alloc_fd(files, 0, rlim_cur, flags);
>> + preempt_enable_no_resched();
>> + ret = __alloc_fd(files, 0, rlim_cur, flags);
>> + preempt_disable();
>> +
>> + return ret;
>> }
>>
>> /*
>> @@ -398,8 +403,11 @@ static int task_get_unused_fd_flags(struct binder_proc *proc, int flags)
>> static void task_fd_install(
>> struct binder_proc *proc, unsigned int fd, struct file *file)
>> {
>> - if (proc->files)
>> + if (proc->files) {
>> + preempt_enable_no_resched();
>> __fd_install(proc->files, fd, file);
>> + preempt_disable();
>> + }
>> }
>>
>> /*
>> @@ -427,6 +435,7 @@ static inline void binder_lock(const char *tag)
>> {
>> trace_binder_lock(tag);
>> mutex_lock(&binder_main_lock);
>> + preempt_disable();
>> trace_binder_locked(tag);
>> }
>>
>> @@ -434,8 +443,65 @@ static inline void binder_unlock(const char *tag)
>> {
>> trace_binder_unlock(tag);
>> mutex_unlock(&binder_main_lock);
>> + preempt_enable();
>> +}
>> +
>> +static inline void *kzalloc_nopreempt(size_t size)
>> +{
>> + void *ptr;
>> +
>> + ptr = kzalloc(size, GFP_NOWAIT);
>> + if (ptr)
>> + return ptr;
>> +
>> + preempt_enable_no_resched();
>> + ptr = kzalloc(size, GFP_KERNEL);
>> + preempt_disable();
>
> Doesn't the allocator retry if the first one fails anyway? Why not
> GFP_NOIO or GFP_ATOMIC? Have you really hit the second GFP_KERNEL
> usage?

I suspect we have hit the second, since we do get into cases where
direct reclaim is needed. I can't confirm since I haven't instrumented
this case. As you say, if we use GFP_ATOMIC instead, maybe we
wouldn't, but even then I'd be concerned that we could deplete the
memory reserved for atomic. The general idea of trying for a fast,
nowait allocation and then enabling preempt for the rare potentially
blocking allocation seems reasonable, doesn't it?

>> +
>> + return ptr;
>> +}
>> +
>> +static inline long copy_to_user_nopreempt(void __user *to,
>> + const void *from, long n)
>> +{
>> + long ret;
>> +
>> + preempt_enable_no_resched();
>> + ret = copy_to_user(to, from, n);
>> + preempt_disable();
>> + return ret;
>> +}
>> +
>> +static inline long copy_from_user_nopreempt(void *to,
>> + const void __user *from,
>> + long n)
>> +{
>> + long ret;
>> +
>> + preempt_enable_no_resched();
>> + ret = copy_from_user(to, from, n);
>> + preempt_disable();
>> + return ret;
>> }
>>
>> +#define get_user_nopreempt(x, ptr) \
>> +({ \
>> + int __ret; \
>> + preempt_enable_no_resched(); \
>> + __ret = get_user(x, ptr); \
>> + preempt_disable(); \
>> + __ret; \
>> +})
>> +
>> +#define put_user_nopreempt(x, ptr) \
>> +({ \
>> + int __ret; \
>> + preempt_enable_no_resched(); \
>> + __ret = put_user(x, ptr); \
>> + preempt_disable(); \
>> + __ret; \
>> +})
>
> Any reason some of these are #defines and some are static inline
> functions?

Not that I know of. I'll change to static inline for the next version.

>
> Anyway, these all seem a bit strange to me, what type of latency spikes
> are you seeing that these changes resolve? Shouldn't that be an issue
> with the scheduler more than just the binder driver?

Well, it certainly is partially a scheduler issue in the sense that we
are using SCHED_NORMAL for latency sensitive tasks that comprise the
display pipeline (surfaceflinger, UI threads, Render threads). This
means that a task being preempted might have to wait for a while
before getting scheduled again. It would be nice to use SCHED_FIFO for
these, but at the moment we can't do that since that opens us up to
user code DOS'ing cores. There are some ideas being investigated here,
but that's a ways off.

It is also partially a binder issue since there is a single global
mutex that is acquired for any binder transaction. There are periods
where the IPC rate is very high. Most of those binder calls are not so
latency sensitive that a little contention for the binder lock
matters, but the binder calls between RenderThreads and Surfaceflinger
are sensitive since there are frame deadlines that need to be met. The
more time spent holding the binder lock, the higher probability for
contention and resulting delays which can cause us to miss the
deadline for the frame.

Without this change, it is common for the task holding the mutex be
preempted, creating a backlog of tasks waiting on the mutex. We
measured the lock being held for over 400us regularly in these cases,
greatly increasing the likelyhood of contention for the mutex. Since
there a several binder calls in the pipeline as part of drawing a
frame, this was observed to induce more than 10ms of frame delay and
sometimes as high as 20ms which translates directly to missed frames.
This change fixed that issue.

>
> I don't know of any other driver or IPC that does this type of thing
> with the scheduler in order to make things "go faster", so it feels
> wrong to me, and is probably why we don't have global functions like
> put_user_nopreempt() :)

I suspect that for most folks, "go faster" means "increase
throughput". In this case "go faster" means "decrease frame render
times" (or, more precisely "meet frame deadlines"). Maybe binder is
special.

>
> And is enabling and disabling preemption around single byte copies
> to/from userspace really a good idea? That seems like a lot of overhead
> you are now adding to your "fastpath" that you need to go even faster.

Don't the user copy functions assume that preemption is enabled? If it
is safe to avoid the toggling here, then I'd agree that it isn't a
"good idea". The overhead of this toggle is in the noise compared a
task being preempted while holding the binder lock.

>
> And finally, I'm guessing this has passed the binder test suite that is
> out there for testing binder changes? If so, can you please mention it
> in the changelog text?

Yes, it did. Will mention in the next version.

>
> thanks,
>
> greg k-h

-Todd