Re: RFC for a new Scheduling policy/class in the Linux-kernel

From: Ted Baker
Date: Thu Jul 16 2009 - 19:13:43 EST


On Thu, Jul 16, 2009 at 09:17:09AM +0200, Henrik Austad wrote:

> My understanding of PEP is that when B executes through the
> A-proxy, B will consume parts of A's resources until the lock is
> freed. This makes sense when A and B runs on different CPUs and
> B is moved (temporarily) to CPU#A. If B were to use it's own
> budget when running here, once A resumes execution and exhaustes
> its entire budget, you can have over-utilization on that CPU
> (and under-util on CPU#B).

To be sure we are using A and B the same way here:
B is holding a lock.
A wants that lock.
A grants its priority B until B releases the lock.

How to look at the charges for usage seems not to have a perfect
solution. That is, you can't get around the fact that either:

(1) B is using some its time at the wrong priority (priority inversion)

or

(2) B is using some of A's time to do B's work (the right
priority, but the wrong task)

The right way to resolve this conflict seems to depend a lot on
where B runs, as well as whether you are managing budget per-CPU
(partitioned model) or managing it globally (free migration
model).

1) In a global scheduling model, it does not matter where B runs.
We want to charge B's critical section to B, since our
schedulability analysis is based on the processor usage of each
task. It would be broken if A could be charged random bits of
time for the execution of other tasks. So, we must charge B.

2) In a partitioned scheuling model, we worry about the
CPU utilization of each processor. We have several cases, depending
where B runs when it runs with A's priority:

2.1) If B gets to run on A's processor, we have a problem
with processor overload unless we charge the usage to A. This
is still a bad thing, though, since we then need to over-provision
A to allow for this.

2.2) If B runs on its own processor, we want to use B's budget.

2.3) A third variatin is that all the critical sections for A and
B run on a separate synchronization procesor. In that case, the
synchronization processor needs its own budget, and in effect we
the critical sections of tasks A and B become aperiodic tasks in
their own right.

> AFAIK, there are no such things as preemption-overhead charging
> to a task's budget in the kernel today. This time simply
> vanishes and must be compensated for when running a task through
> the acceptance-stage (say, only 95% util pr CPU or some such).

I don't see that anything can or does "vanish". Consider this
sequence of events on one processor:

1. task A is running, and calls scheduler (synchronously or maybe
via IRQ, say from timer)

2. scheduler computes how much time A has used recently, and charges
it to A's budget

The overhead of the IRQ handler and scheduler are therefore
charged to A.

3. scheduler switches to B

4. B calls scheduler

6. scheduler computes how much time B has used recently,
and charges it to B's budget

The rest of the overhead of the context switch from A to B, including cache
misses and page faults immediately following 3 are now
effectively charged to B.

> > Back to the original question, of who should be charged for
> > the actual critical section.

> That depends on where you want to run the tasks. If you want to
> migrate B to CPU#A, A should be charged. If you run B on CPU#B,
> then B should be charged (for the exact same reasoning A should
> be charged in the first case).

As I mentioned above, it also depends on whether you are using
a partitioned or global scheduling policy for your analysis.

> The beauty of PEP, is that enabling B to run is very easy. In
> the case where B runs on CPU#B, B must be updated statically so
> that the scheduler will trigger on the new priority. In PEP,
> this is done automatically when A is picked. One solution to
> this, would be to migrate A to CPU#B and insert A into the
> runqueue there. However, then you add more overhead by moving
> the task around instead of just 'borrowing' the task_struct.

> > From the schedulability analysis point of view, B is getting
> > higher priority time than it normally would be allowed to execute,
> > potentially causing priority inversion (a.k.a. "interference" or
> > "blocking") to a higher priority task D (which does not even share
> > a need for the lock that B is holding) that would otherwise run on
> > the same processor as B. Without priority inheritance this kind
> > of interferfence would not happen. So, we are benefiting A at the
> > expense of D. In the analysis, we can either allow for all such
> > interference in a "blocking term" in the analysis for D, or we
> > might call it "preemption" in the analysis of D and charge it to A
> > (if A has higher priority than D). Is the latter any better?

> If D has higher priority than A, then neither A nor B (with the
> locks held) should be allowed to run before D.

Yes, but I only meant D has higher priority than B. A may have
higher priority than D, but with multiple processors we can't just
focus on the one highest priority task. We need to look at the
(number of CPUs) highest priority tasks, or we will may end
up with our system behaving as if we have only one processor.

This is a key difference with multiprocessor scheduling,
and also with processor "share" scheduilng.

Expediting the highest priority task is *not* always the
best strategy. Assuming that each task has some amount of slack,
there is no great benefit for completing a high-priority task
early, if that means causing other tasks to miss their deadlines.

That is, if by delaying the highest-priority task a little bit
on one processor you can keep more processors busy,
you may be able to successfully schedule a set of tasks that
could not be scheduled to meet deadlines otherwise.

(I can give you an example of how this happens if you want, but it
would tedious if you can get my point without the example.)

> Yes, no task should be allowed to run more than the budget, but
> that requires B to execute *only* on CPU#B.

As mentioned above, that depends on whether you are
applying a global or partitioned model. With a global
scheduling model the budget is not per-CPU.

> On the other hand, one could say that if you run PEP and B is
> executed on CPU#A, and A then exhausts its budget, you could
> blame A as well, as lock-contention is a common problem and it's
> not only the kernel's fault. Do we need perfect or best-effort
> lock-resolving?

Yes. For application locks, with application-managed partitioned
scheduling, you can blame the application designer for building
in cross-CPU contention.

For kernel (hidden from the application) locks, it is a harder
problem.

I don't think there is a perfect solution for the partitioned
scheduling model. As mentioned above, you get to choose between B
causing short-term priority inversion for other tasks on B's
processor (but not using more than its budget on the average), or
B causing budget over-run for A on A's processor.

Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/