Re: [PATCH v3 1/6] ANDROID: binder: add support for RT prio inheritance.

From: Martijn Coenen
Date: Thu Nov 16 2017 - 04:18:13 EST

Thanks Peter for looking at this, more inline.

On Wed, Nov 15, 2017 at 2:01 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>> + * 1) binder supports a "minimum node priority", meaning that all transactions
>> + * into a node must run at this priority at a minimum. This means that the
>> + * desired priority for handling a transaction is not necessarily equal to
>> + * the priority of the caller.
> Isn't that the same as running everything in that node (whatever that
> is) at that given prio?

Not for synchronous transactions; in that case it's max(min_node_prio,
caller_prio). One reason min_node_prio exists is that for certain
critical nodes, we don't want calls into it to run at a very low
priority, because the implementation of said calls may take locks that
are contended. I realize one possible solution to that is to use PI
mutexes in userspace, but Android doesn't support that today.

>> + * 2) binder supports asynchronous transactions, where the caller is not blocked
>> + * on transaction completion; so, it also can't be blocked on an rt_mutex.
> This is not a theoretically sound thing to do... For a system to be
> real-time it not only needs to guarantee forward progress but also have
> bounded runtime.
> Elevating another task's priority to your own and then not blocking
> effectively injects time. At the very least this time should be taking
> into consideration when doing runtime analysis of the system.

In case of asynchronous transactions, we never inherit the priority of
the caller. We run at max(default_prio, min_node_prio). For the
default node configuration, this means that we don't change priority
at all for async transactions (they all run at nice 0). But for nodes
that have a minimum priority set, we want calls into the node to run
at that priority instead.

> Also, since its async, why do we care? The original task is still making
> progress. Only at the point where we start to wait for completion does
> it make sense to boost. Otherwise you end up having to charge one task
> double time.

Yeah, so this is actually not about the caller; it's about certain
critical nodes that need to process events at a high priority. For
example, consider a node that receives sensor events using async
transactions. It's imperative that the threads handling these events
run at real-time priority from the moment they wake up. Binder's
thread model has a generic threadpool for the entire process; threads
aren't dedicated to specific nodes. So to avoid such a thread being
preempted, we need to raise its priority even before waking it up, and
the only place to do it is from the caller.

>> + * 3) similarly, there may not necessarily be a thread waiting for
>> + * transactions at the time the call is made, so we don't know who to proxy-
>> + * lock the lock for.
> Non-issue; you have the exact same problem now. Your interface takes a
> @task argument, so you have to have a @task either way around.

But in my current implementation once a thread does call in for more
work and finds the work pending, it will raise its own priority. I
don't see how this is possible with rt_mutex, because the thread is
not yet available at the moment the caller goes to sleep.

>> + * 4) binder supports nested transactions, where A can call into B, and B can
>> + * call back into A before returning a reply to the original transaction.
>> + * This means that if A is blocked on an rt_mutex B holds, B must first wake
>> + * up A to handle a new transaction, and only then can it proxy-lock and try
>> + * to acquire the new rt_mutex. This leaves a race condition where B
>> + * temporarily runs at its original priority.
> That doesn't sound like something a well formed (RT) program should do
> in the first place.

What do you mean specifically - nested transactions?

> You could play horrible games with lock ownership, but YUCK!!
>> + * 5) rt_mutex does not currently support PI for CFS tasks.
> Neither do you.. just inheriting the nice value is not correct for a WFQ
> style scheduler.

I don't think we're trying to make this fit perfectly with WFQ - we're
just trying to improve the latency on Android devices where it counts,
by telling the scheduler which threads are important. Without these
patches, we would often see binder threads receiving sensor events get
preempted in the kernel, and the delay would cause issues with things
like video stabilization, etc.

> What you're proposing is a bunch of make-work-ish hacks on a system that
> isn't designed for RT.

What parts of the system aren't designed for RT? I'm trying to
understand if it's something in the way binder currently works - eg
the use of asynchronous transactions and generic threadpools - or
something in this patchset in particular. How should we deal with
high-priority sensor events that go over binder IPC in your opinion?