Re: OSDL Bug 3770

From: Nick Piggin
Date: Sat Dec 18 2004 - 04:44:54 EST

Loic Domaigne wrote:

Hello Nick!
Hello NPTL Mailing List!

Hello Loic! Thanks for the interesting mail.

I'm CCing lkml and Ingo with this, because I wouldn't feel comfortable to veto
this myself.

lkml: We're discussing the fact that on SMP machines, our realtime scheduling
policies are per-CPU only. This caused a problem where a high priority task on
one CPU caused all lower priority tasks on that CPU to be starved, while tasks
on another CPU with the same low priority were able to run.

Ah, the problem is that when the driver thread has a higher
priority than the worker threads, so when the driver goes into an
infinite loop waiting, the able to schedule, however.

Although POSIX legally permits such implementation for realtime policy on SMP machines, this implementation is clearly *NOT* REASONABLE.

Well I haven't done much in the realtime area... but nobody has complained till now.

The reason is extremely simple: the application *CANNOT* necessarily known that it gets stuck behind a higher-priority thread (though it could had run on another CPU if the scheduler had decided otherwise). That's *NOT* doable to program in a deterministic fashion in such "realtime"-environement

You could use CPU binding. I'd argue that this may be nearly a requirement for
any realtime system of significant complexity on an SMP system.

*But*, notice that the program in question did not run on UP and randomly fail
on SMP, rather it would not work on single processor AT ALL.

"Realtime" put into quote. I am speaking here of soft realtime, that is an environment whose tasks scheduling follow a specific deterministic order. I am not speaking about hard-realtime that have additional timing constraints. Following that definition, we can say that Linux offers (soft) "Realtime".

> The driver really needs to sleep, use a mutex, use a lower priority,

or something in order for it to work.

NO! It is not the responsability of the application to fix that behavior! We can in our case because 'we know', but some applications don't!!!

That's a bit hand-wavy ;) but I don't dismiss it out of hand because as I said,
I'm not so familiar with this area. I would be interested in an example of some
application where this matters, and which absolutely can't use any synchronisation

The mistake done here is interesting. When you have a pool of servers, you can proceed in two ways to serve the clients:

(1) make a FIFO queue for each server. When a client arrives, it
chooses the queue that is the shortest.

(2) make an unique FIFO queue for all servers. All clients are
queued, and when a server is done it takes the first client
waiting on that big queue.

Queuing theory proves that (2) is better. Exactly due to the reason we have here. With (1), the guys in the queue might get stuck if the corresponding server is blocked by a client. With (2), when a server is blocked by a client, it doesn't prevent the other clients to be served by other servers.

But that model is flawed for SMP scheduling. If it were that easy, we might have a
single queue for _all_ tasks.

The main problem is the cost of synchronisation and cacheline sharing. A secondary
problem is that of CPU affinities - moving a task to another CPU nearly always has
some non zero cost in terms of cache (and in case of NUMA, memory) efficiency.

Our global queue scheduler was basically crap for more than 4 CPUs. We could give
RT tasks a global queue with little impact to non-RT workloads (in fact, I think
early iterations of the 2.6 scheduler trialed this)... but let's not cripple the
RT apps that do the right thing (and need scalablility).

Another problem is that scheduling may not be O(1) anymore, if you have CPU affinity
bindings in place.

To summaries, I believe that if per-CPU RT queues is allowed within POSIX, then we
want to go with the sanest possible implementation, and force any broken apps to
fix themselves.... let's not cave in now :)


An historical note. USA had implemented (2) in offices, supermarkets and such long before Europa. Because in Europe, customers were convinced that model (2) took more time, because the queue was longer.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at