[PATCH 2.6.13-rc6-rt9] PI aware dynamic priority adjustment
From: Thomas Gleixner
Date: Fri Aug 19 2005 - 18:48:15 EST
Hi all,
On Thu, 2005-08-18 at 08:01 +0200, Ingo Molnar wrote:
> i have released the 2.6.13-rc6-rt9 tree, which can be downloaded from
> the usual place:
I reworked the code for dynamically setting the priority of the hrtimer
softirq to be aware of PI.
The current function "mutex_chprio()" in -rt9 is equivivalent to a
static priority assigment of the hrtimer softirq. The new PI aware
function resolves this nicely and the latency improvements are
impressive (factor 2-3).
I stumbled over "//#define MUTEX_IRQS_OFF" in the first attempt. My
assumption that all code protected by pi_lock (which is a raw lock) is
irq save turned out to be wrong. I missed that commented define :(
I guess it was introduced during the "IRQ latency contest" to squeeze
out the last nsecs :)
Switching it back on is not really influencing system performance in a
measurable way, but it allows to use the pi aware boosting function in
irq context.
Quite contrary it makes the system more snappy and the overall test
latencies go down.
Ingo ????
I assume you run with tracing enabled all the time, which conceals this
regression.
This is not only a demand for HRT, I'm aware of a _real world_
application which requires such functionality depending on the status of
the device. It runs on a low performance machine so avoiding task
switches is really appreciated.
--------------------------------------------
Find attached my current config and a simple test program for cyclic
scheduling. Read the top source comment what you need and where to get
it.
Please use the -n option for now until we have sorted out following
problems:
1. tasklist_lock contention
AFAICT tasklist_lock is the longest held lock in the kernel aside BKL,
but this problem seems to be solved.
send_sigqueue is called from posix_timer_fn() and acquires
tasklist_lock, which makes no sense to me.
send_sigqueue()s (l)onl(e)y user is the posix_timer function
(posix_timer_fn(), calling posix_timer_event()).
Each posix timer blocks the task from vanishing away by
get_task_struct(), which is protected by the held tasklist_lock.
The task can neither go away nor the signal handler can change until
put_task_struct() is called inside release_posix_timer(), which removes
any chance to do an invalid access to either task or sighand because the
relevant timer is deleted before the call to put_task_struct(). Also
this call is protected by tasklist_lock().
If I understand the code correctly then the tasklist_lock acquirement in
send_sigqueue() is superfluous. If not then the "(un)register action" of
put/get_task_struct() is just hot air action. AFAICS from kernel history
this interface was introduced to provide lock free access to certain
parts of the signaling code where the caller holds a reference to the
task structure instead of running through the PITA of find_task_by_pid()
- of course with tasklist_lock held - for every signal to deliver.
I ran already extensive tests on a kernel with the lock acquirement
removed without any problems - as expected by my shortsight :)
Can the experts please shed some light on me ?
Oleg, Paul, Roland ??? (Alphabetic order :)
If I'm not completely wrong, then this mechanism should be generalized
to allow other users than posixtimers a simple un/registering of this
protection. I know a couple of things where this would be useful. I'd be
happy to write up the relevant patch for the general interface.
If I missed something important, feel free to call me an (over-worked)
idiot :)
2. Drift of cyclic timers (armed by set_timer()):
Due to rounding errors and the drift adjustment code, the fixed
increment which is precalculated when the timer is set up and added on
rearm, I see creeping deviation from the timeline.
I have a patch lined up to base the rearm on human (nsac) units, so this
effect will go away. But this is waste of time until (1.) is not solved.
George ???
Cheers,
tglx
-----------------------------------
Some numbers:
Test scenario:
PIII 666MHZ 128MB
Kernel command line: root=/dev/hda1 ro console=ttyS0,115200 lapic nmi_watchdog=2
Using lapic gives definitely better results than using the solely PIT
based version. Make sure to add it to your command line, if you have a
x86 box. PPC is fine without.
Load:
ssh1# while true; do hackbench 20; done
ssh2# while true; do find -type f | xargs grep kernel; done
(Over a large tree so caching is not relevant. The hd led is almost permanent on)
ping -f to the test box
All I(nterval)/Min/Act/Max values in usecs.
test1:/home/tglx/cyclictest# ./cyclictest -p 80 -n -i 1100
293.12 276.22 198.22 88/202 12270 <---- output from /proc/loadavg
T: 0 ( 5407) P:80 I: 1100 C: 2047066 Min: 7 Act: 15 Max: 59
test1:/home/tglx/cyclictest# ./cyclictest -p 80 -n -t 2 -d 10
296.17 275.21 193.82 52/202 22270
T: 0 ( 5407) P:80 I: 1000 C: 1047066 Min: 7 Act: 15 Max: 60
T: 1 ( 5408) P:79 I: 1010 C: 1036659 Min: 7 Act: 11 Max: 79
test1:/home/tglx/cyclictest# ./cyclictest -p 80 -n -t 5
130.31 124.17 148.68 3/525 19070
T: 0 ( 5473) P:80 I: 1000 C: 2631537 Min: 7 Act: 18 Max: 73
T: 1 ( 5474) P:79 I: 1500 C: 1754358 Min: 7 Act: 15 Max: 70
T: 2 ( 5475) P:78 I: 2000 C: 1315769 Min: 6 Act: 13 Max: 70
T: 3 ( 5476) P:77 I: 2500 C: 1052615 Min: 7 Act: 12 Max: 68
T: 4 ( 5477) P:76 I: 3000 C: 877179 Min: 6 Act: 10 Max: 66
test1:/home/tglx/cyclictest# ./cyclictest -p 80 -n -t 5 -i 800 -d 240
177.44 130.70 129.67 15/176 7217
T: 0 (21198) P:80 I: 800 C: 1578221 Min: 7 Act: 13 Max: 82
T: 1 (21199) P:79 I: 1040 C: 1214016 Min: 7 Act: 12 Max: 84
T: 2 (21200) P:78 I: 1280 C: 986388 Min: 7 Act: 13 Max: 60
T: 3 (21201) P:77 I: 1520 C: 830643 Min: 7 Act: 14 Max: 71
T: 4 (21202) P:76 I: 1760 C: 717373 Min: 7 Act: 13 Max: 66
test1:/home/tglx/cyclictest# ./cyclictest -p 80 -n -t 5 -i 666 -d 33
79.75 106.97 106.85 43/869 30355
T: 0 (25687) P:80 I: 666 C: 9329552 Min: 6 Act: 9 Max: 79
T: 1 (25688) P:79 I: 699 C: 8889101 Min: 6 Act: 12 Max: 80
T: 2 (25689) P:78 I: 732 C: 8488363 Min: 6 Act: 10 Max: 87
T: 3 (25690) P:77 I: 765 C: 8122198 Min: 6 Act: 12 Max: 83
T: 4 (25692) P:76 I: 798 C: 7786318 Min: 6 Act: 10 Max: 94
-----------------------------------
diff -uprN --exclude-from=/usr/local/bin/diffit.exclude linux-2.6.13-rc6-rt9/include/linux/sched.h linux-2.6.13-rc6-rt9.work/include/linux/sched.h
--- linux-2.6.13-rc6-rt9/include/linux/sched.h 2005-08-18 17:37:42.000000000 +0200
+++ linux-2.6.13-rc6-rt9.work/include/linux/sched.h 2005-08-19 18:45:31.000000000 +0200
@@ -1134,7 +1134,7 @@ extern int idle_cpu(int cpu);
extern int sched_setscheduler(struct task_struct *, int, struct sched_param *);
extern task_t *idle_task(int cpu);
extern void mutex_setprio(task_t *p, int prio);
-extern void mutex_chprio(task_t *p, int prio);
+extern void pi_changeprio(task_t *p, int prio);
extern int normal_prio(task_t *p);
void yield(void);
diff -uprN --exclude-from=/usr/local/bin/diffit.exclude linux-2.6.13-rc6-rt9/kernel/rt.c linux-2.6.13-rc6-rt9.work/kernel/rt.c
--- linux-2.6.13-rc6-rt9/kernel/rt.c 2005-08-18 17:37:42.000000000 +0200
+++ linux-2.6.13-rc6-rt9.work/kernel/rt.c 2005-08-20 00:49:15.000000000 +0200
@@ -201,15 +201,9 @@ do { \
# define trace_local_irq_disable(ti) raw_local_irq_disable()
# define trace_local_irq_enable(ti) raw_local_irq_enable()
# define trace_local_irq_restore(flags, ti) raw_local_irq_restore(flags)
-#else /* !CONFIG_RT_DEADLOCK_DETECT */
-/*
- * Two variants: irqs off and preempt-counter based.
- * preempt-counter based variant seems to be a bit faster.
- */
-//#define MUTEX_IRQS_OFF
+#else /* !CONFIG_RT_DEADLOCK_DETECT */
-#ifdef MUTEX_IRQS_OFF
# define trace_lock_irq(lock, ti) do { raw_local_irq_disable(); (void)(ti); } while (0)
# define trace_lock_irqsave(lock, flags, ti) do { raw_local_irq_save(flags); (void)(ti); } while (0)
# define trace_unlock_irq(lock, ti) raw_local_irq_enable()
@@ -217,15 +211,6 @@ do { \
# define trace_local_irq_disable(ti) raw_local_irq_disable()
# define trace_local_irq_enable(ti) raw_local_irq_enable()
# define trace_local_irq_restore(flags, ti) raw_local_irq_restore(flags)
-#else
-# define trace_lock_irq(lock, ti) preempt_disable_ti(ti)
-# define trace_lock_irqsave(lock, flags, ti) do { (void)flags; preempt_disable_ti(ti); } while (0)
-# define trace_unlock_irq(lock, ti) preempt_enable_ti(ti)
-# define trace_unlock_irqrestore(lock, flags, ti) do { (void)flags; preempt_enable_ti(ti); } while (0)
-# define trace_local_irq_disable(ti) preempt_disable_ti(ti)
-# define trace_local_irq_enable(ti) preempt_enable_ti(ti)
-# define trace_local_irq_restore(flags, ti) do { (void)flags; preempt_enable_ti(ti); } while (0)
-#endif
# define trace_unlock(lock, ti) do { } while (0)
@@ -235,6 +220,7 @@ do { \
# define TRACE_WARN_ON_LOCKED(c) do { } while (0)
# define TRACE_OFF() do { } while (0)
# define TRACE_BUG_ON_LOCKED(c) do { } while (0)
+
#endif /* CONFIG_RT_DEADLOCK_DETECT */
/*
@@ -829,6 +815,50 @@ static void pi_setprio(struct rt_mutex *
}
}
+/*
+ * Change priority of a task pi aware
+ *
+ * There are several aspects to consider:
+ * - task is priority boosted
+ * - task is blocked on a mutex
+ *
+ */
+void pi_changeprio(struct task_struct *p, int prio)
+{
+ unsigned long flags;
+ int oldprio;
+
+ spin_lock_irqsave(&pi_lock, flags);
+
+ oldprio = p->normal_prio;
+ if (oldprio == prio)
+ goto out;
+
+ /* Set normal prio in any case */
+ p->normal_prio = prio;
+
+ /* Check, if we can safely lower the priority */
+ if (prio > p->prio && !plist_empty(&p->pi_waiters)) {
+ struct rt_mutex_waiter *w;
+ w = plist_first_entry(&p->pi_waiters,
+ struct rt_mutex_waiter, pi_list);
+ if (w->ti->task->prio < prio)
+ prio = w->ti->task->prio;
+ }
+
+ if (prio == p->prio)
+ goto out;
+
+ /* Is task blocked on a mutex ? */
+ if (p->blocked_on)
+ pi_setprio(p->blocked_on->lock, p, prio);
+ else
+ mutex_setprio(p, prio);
+ out:
+ spin_unlock_irqrestore(&pi_lock, flags);
+
+}
+
static void
task_blocks_on_lock(struct rt_mutex_waiter *waiter, struct thread_info *ti,
struct rt_mutex *lock __EIP_DECL__)
diff -uprN --exclude-from=/usr/local/bin/diffit.exclude linux-2.6.13-rc6-rt9/kernel/sched.c linux-2.6.13-rc6-rt9.work/kernel/sched.c
--- linux-2.6.13-rc6-rt9/kernel/sched.c 2005-08-18 17:37:42.000000000 +0200
+++ linux-2.6.13-rc6-rt9.work/kernel/sched.c 2005-08-19 18:44:31.000000000 +0200
@@ -3954,58 +3954,6 @@ void mutex_setprio(task_t *p, int prio)
task_rq_unlock(rq, &flags);
}
-/*
- * Used by the PREEMPT_RT code to implement
- * priority inheritance logic for timers:
- *
- * (differs in that the priority change is permanent)
- */
-void mutex_chprio(task_t *p, int prio)
-{
- unsigned long flags;
- prio_array_t *array;
- runqueue_t *rq;
- int oldprio, prev_resched;
-
- BUG_ON(prio < 0 || prio > MAX_PRIO);
- WARN_ON(p->policy == SCHED_NORMAL);
-
- rq = task_rq_lock(p, &flags);
-
- oldprio = p->prio;
- array = p->array;
- if (array)
- dequeue_task(p, array);
- p->normal_prio = prio;
- if (prio < p->prio)
- p->prio = prio;
-
- trace_special_pid(p->pid, oldprio, prio);
- prev_resched = _need_resched();
- if (array) {
- /*
- * If changing to an RT priority then queue it
- * in the active array!
- */
- if (rt_task(p))
- array = rq->active;
- enqueue_task(p, array);
- /*
- * Reschedule if we are currently running on this runqueue and
- * our priority decreased, or if we are not currently running on
- * this runqueue and our priority is higher than the current's
- */
- if (task_running(rq, p)) {
- if (p->prio > oldprio)
- resched_task(rq->curr);
- } else if (TASK_PREEMPTS_CURR(p, rq))
- resched_task(rq->curr);
- }
- trace_special(prev_resched, _need_resched(), 0);
-
- task_rq_unlock(rq, &flags);
-}
-
#ifdef __ARCH_WANT_SYS_NICE
/*
diff -uprN --exclude-from=/usr/local/bin/diffit.exclude linux-2.6.13-rc6-rt9/kernel/softirq.c linux-2.6.13-rc6-rt9.work/kernel/softirq.c
--- linux-2.6.13-rc6-rt9/kernel/softirq.c 2005-08-18 17:37:42.000000000 +0200
+++ linux-2.6.13-rc6-rt9.work/kernel/softirq.c 2005-08-19 18:44:31.000000000 +0200
@@ -314,7 +314,7 @@ void raise_softirq_prio(unsigned long nr
if (prio > (MAX_RT_PRIO - 1))
prio = MAX_RT_PRIO - 1;
- mutex_chprio(tsk, prio);
+ pi_changeprio(tsk, prio);
wake_up_process(tsk);
}
#endif
diff -uprN --exclude-from=/usr/local/bin/diffit.exclude linux-2.6.13-rc6-rt9/kernel/timer.c linux-2.6.13-rc6-rt9.work/kernel/timer.c
--- linux-2.6.13-rc6-rt9/kernel/timer.c 2005-08-18 17:37:43.000000000 +0200
+++ linux-2.6.13-rc6-rt9.work/kernel/timer.c 2005-08-19 18:44:31.000000000 +0200
@@ -1731,7 +1731,7 @@ static inline void recalc_priority(tvec_
if (prio == hrbase->hr_prio)
return;
hrbase->hr_prio = prio;
- mutex_chprio(current, prio);
+ pi_changeprio(current, prio);
spin_unlock_irq(&hrbase->t_base.lock);
cond_resched();
spin_lock_irq(&hrbase->t_base.lock);
Attachment:
cyclictest-0.1.tar.bz2
Description: application/bzip-compressed-tar
Attachment:
test.config.bz2
Description: application/bzip