Re: [patch, 2.6.11-rc2] sched: RLIMIT_RT_CPU_RATIO feature

From: Jack O'Quin
Date: Wed Jan 26 2005 - 19:47:33 EST


Ingo Molnar <mingo@xxxxxxx> writes:

> - exported the current RT-average value to /proc/stat (it's the last
> field in the cpu lines)

> e.g. the issue Con and others raised: privileged tasks. By default, the
> root user will likely have a 0 rlimit, for compatibility. _But_ i can
> easily imagine users wanting to put in a safety limit even for
> root-owned RT tasks by making the rlimit 98% or so. Nonprivileged users
> would have this rlimit at say 20% in a typical desktop distribution.

That seems rather small. CPU starvation is not generally much of a
problem on desktop machines. If that (single) user wants to eat up
70% or 80% of the CPU, that's not likely to be a problem. Mac OS X
allows 90% on their desktop systems.

>> So, what does it mean for a task to say "I'm in the 60% range"? That
>> it individually will never use more than 60% of any one CPU? Or, that
>> it and several other associated tasks will never use more than that?
>
> it means that it will only be allowed to run if the CPU utilization of
> all RT tasks (eligible to run) do not exceed 60%.

But how would people use it?

There are many RT application scenarios I don't know about, and
probably some more I'm forgetting at the moment. But, I do have at
least one user-oriented view of the problem. Perhaps others will
mention additional scenarios based on their own experience.

My initial reaction is that the kind of priority-based SCHED_FIFO and
SCHED_RR techniques we're providing (i.e. POSIX realtime) really only
supports one set of at least loosely cooperating RT threads. For
multiple, non-cooperating RT subsystems, one needs something more like
a deadline scheduler. That would be a nice project for someone to
work on, but we are definitely *not* asking for it here.

Priority-based RT thread cooperation generally involves a "realtime
pyramid". A thread with a short trigger latency needs to run at high
priority, near the top of the pyramid. It must do its job quickly to
avoid delaying the lower-priority threads. So, high priority
generally imposes a tight pathlength restriction. Threads at the base
of the pyramid can run longer, but have longer trigger latencies due
to the number and worst-case pathlengths of the layers above them.
You can even think of IRQ handlers as occupying "special" hardware
scheduled priorities at the apex of the pyramid.

But, rlimits is per-process, not per-thread (right??). So, the whole
RT subsystem is going to end up needing to share a single, overall
RLIMIT_RT_CPU value. This makes rlimits a rather poor fit for the
purpose. I could imagine limiting CPU use per priority level (not per
process), with the kernel helping to enforce the realtime priority
pyramid directly, restricting higher priorities to fewer cycles. But,
I have no well thought out proposal like that (as yet).

So apparently, this clumsy rlimits mechanism is really only going to
store two or three different values for every process in the system,
for example: (1) 0% for ordinary processes, (2) 80% for desktop audio
and video users, and (3) 98% for root and perhaps a few other
privileged processes. I probably still don't understand exactly how
these work, but does anyone seriously expect there to be more?

> yes, even though the way it increases is not linear, you can expect the
> average to increase if you run for 'too long' - where 'too long' is
> roughly (percentage * 300msecs). I.e. if you have it at 80% then you
> should expect the limit to kick in after running for 240 msecs. This
> shouldnt be a practical issue, unless an RT application has a very
> 'choppy' workload (processing for 1 second, then sleeping for 1 second,
> etc.) - in which case it needs an increased limit. There obviously must
> be some sort of boundary where the scheduler says 'this is enough'.

This seems reasonable to me. AFAIK, very few RT applications are
*that* intermittent. Generally, there is some basic cycle in which
things get done. Sometimes there are cycles within cycles.

I think I see a way to do graceful degradation with arbitrary shorter
RT cycles. JACK currently watches its clients to detect when they
exceed the available cycle duration. JACK could query its own
RLIMIT_RT_CPU value and use that percentage of the actual cycle time
as the time limit. If this is exceeded, jackd would start shutting
down clients before the scheduler shuts down realtime operation
completely, thus (perhaps) avoiding a catastrophic failure of the
whole subsystem (like we saw in my tests). Since it makes no sense
for the clients to run with a different RLIMIT_RT_CPU value, this
ought to work in practice. Libjack could complain if a client with a
lower limit tried to connect. Each JACK server only handles one user
at a time, anyway.

Will the scheduler support that approach?
--
joq
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/