Re: OSS sound emulation broken between 2.6.0-test2 and test3

From: Martin Schlemmer
Date: Sat Dec 27 2003 - 12:29:57 EST


On Sat, 2003-12-27 at 18:55, Edward Tandi wrote:
> On Sat, 2003-12-27 at 14:33, Martin Schlemmer wrote:
> > On Sat, 2003-12-27 at 15:08, Edward Tandi wrote:
> > > On Sat, 2003-12-27 at 12:24, Martin Schlemmer wrote:
> > > > On Sat, 2003-12-27 at 13:44, Edward Tandi wrote:
> > > > > On Sat, 2003-12-27 at 11:11, Martin Schlemmer wrote:
> > > > > > On Sat, 2003-12-27 at 09:50, Martin J. Bligh wrote:
> > > > > > > Something appears to have broken OSS sound emulation between
> > > > > > > test2 and test3. Best I can tell (despite the appearance of the BK logs),
> > > > > > > that included ALSA updates 0.9.5 and 0.9.6. Hopefully someone who
> > > > > > > understands the sound architecture better than I can fix this?
> > > > > > >
> > > > > >
> > > > > > I wont say I understand it, but a quick look seems the major change is
> > > > > > the addition of the 'whole-frag' and 'no-silence' opts. You might try
> > > > > > the following to revert what 'no-silence' change at least does:
> > > > > >
> > > > > > --
> > > > > > # echo 'xmms 0 0 no-silence' > /proc/asound/card0/pcm0p/oss
> > > > > > # echo 'xmms 0 0 whole-frag' > /proc/asound/card0/pcm0p/oss
> > > > > > --
> > > > >
> > > > > Thanks, that fixes it for me. I too have been seeing terrible problems
> > > > > with XMMS since the early 2.6 pre- kernels.
> > > > >
> > > > > Because it only happens in XMMS I thought it was one of those
> > > > > application bugs brought out by scheduler changes. I now use Zinf BTW
> > > > > -It's better for large music collections (although not as stable or
> > > > > flash).
> > > > >
> > > >
> > > > Can you check which one actually fixes it ?
> > >
> > > Yes, its the 'whole-frag' line.
> > >
> >
> > Well, I can't say I can see why. In snd_pcm_oss_write1 where the change
> > for whole-frag was, I cannot see a race or such. The only possible
> > causes I can see, is that:
> > 1) xmms's scheduling gets screwed due to the short writes
> > 2) some drivers may have issues with getting short writes all the time.
> >
> > Only problem with 2), is that Zinf works fine for you, so I guess the
> > only thing to assume is that either Zinf do not use OSS, but ALSA
> > interface, or that 1) is indeed the correct answer. What version of
> > XMMS do you use btw?
>
> I was originally running 1.2.7 when I first encountered the problem. I
> then built 1.2.8 -but no difference.
>
> Zinf does indeed use the ALSA interface (as most apps do nowadays).
>

Ok, so it still can go either way.

> I would say the symptoms are that the music starts playing OK bit after
> a short period (18-19 seconds), the music changes overall speed (by a
> semi-tone or so). When it does this, the sound also starts to break up.
> This is why I associated the problem with the process scheduling changes
> being made at the time.
>

I do not use the default scheduler - I attached Nick's one, with a
context-switch-accounting-fix.patch which should be applied before
sched-rollup-v19a.patch ... I cannot reboot now with vanilla to test
if this will break things for me again (until weekend is past), but
you may try with it and see if the scheduler theory is correct.

> It could be a driver issue. FYI, I am using a VIA KT400 chipset. Any one
> know of any low-level timing issues with the KT400?
>

I do not have a box with this chipset, so I do not follow related
threads, sorry.


--
Martin Schlemmer

From: Nick Piggin <piggin@xxxxxxxxxxxxxxx>

Make sure to count kernel preemption as a context switch. A short cut
has been preventing it.



kernel/sched.c | 37 +++++++++++++++----------------------
1 files changed, 15 insertions(+), 22 deletions(-)

diff -puN kernel/sched.c~context-switch-accounting-fix kernel/sched.c
--- 25/kernel/sched.c~context-switch-accounting-fix 2003-11-11 19:29:37.000000000 -0800
+++ 25-akpm/kernel/sched.c 2003-11-11 19:29:37.000000000 -0800
@@ -1513,33 +1513,20 @@ need_resched:

spin_lock_irq(&rq->lock);

- /*
- * if entering off of a kernel preemption go straight
- * to picking the next task.
- */
- if (unlikely(preempt_count() & PREEMPT_ACTIVE))
- goto pick_next_task;
-
- switch (prev->state) {
- case TASK_INTERRUPTIBLE:
- if (unlikely(signal_pending(prev))) {
+ if (prev->state != TASK_RUNNING &&
+ likely(!(preempt_count() & PREEMPT_ACTIVE)) ) {
+ if (unlikely(signal_pending(prev)) &&
+ prev->state == TASK_INTERRUPTIBLE)
prev->state = TASK_RUNNING;
- break;
- }
- default:
- deactivate_task(prev, rq);
- prev->nvcsw++;
- break;
- case TASK_RUNNING:
- prev->nivcsw++;
+ else
+ deactivate_task(prev, rq);
}
-pick_next_task:
- if (unlikely(!rq->nr_running)) {
+
#ifdef CONFIG_SMP
+ if (unlikely(!rq->nr_running))
load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
- if (rq->nr_running)
- goto pick_next_task;
#endif
+ if (unlikely(!rq->nr_running)) {
next = rq->idle;
rq->expired_timestamp = 0;
goto switch_tasks;
@@ -1586,6 +1573,12 @@ switch_tasks:
prev->timestamp = now;

if (likely(prev != next)) {
+ if (prev->state == TASK_RUNNING ||
+ unlikely(preempt_count() & PREEMPT_ACTIVE))
+ prev->nivcsw++;
+ else
+ prev->nvcsw++;
+
next->timestamp = now;
rq->nr_switches++;
rq->curr = next;

_
linux-2.6-npiggin/arch/i386/kernel/smpboot.c | 4
linux-2.6-npiggin/fs/proc/array.c | 8
linux-2.6-npiggin/include/linux/init_task.h | 4
linux-2.6-npiggin/include/linux/sched.h | 12
linux-2.6-npiggin/include/linux/topology.h | 8
linux-2.6-npiggin/init/main.c | 1
linux-2.6-npiggin/kernel/fork.c | 30
linux-2.6-npiggin/kernel/sched.c | 1237 ++++++++++++++-------------
8 files changed, 688 insertions(+), 616 deletions(-)

diff -puN arch/i386/kernel/smpboot.c~rollup arch/i386/kernel/smpboot.c
--- linux-2.6/arch/i386/kernel/smpboot.c~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/arch/i386/kernel/smpboot.c 2003-11-15 23:46:34.000000000 +1100
@@ -915,13 +915,13 @@ static void smp_tune_scheduling (void)
cacheflush_time = (cpu_khz>>10) * (cachesize<<10) / bandwidth;
}

- cache_decay_ticks = (long)cacheflush_time/cpu_khz + 1;
+ cache_decay_ticks = (long)cacheflush_time/cpu_khz * HZ / 1000;

printk("per-CPU timeslice cutoff: %ld.%02ld usecs.\n",
(long)cacheflush_time/(cpu_khz/1000),
((long)cacheflush_time*100/(cpu_khz/1000)) % 100);
printk("task migration cache decay timeout: %ld msecs.\n",
- cache_decay_ticks);
+ (cache_decay_ticks + 1) * 1000 / HZ);
}

/*
diff -puN fs/proc/array.c~rollup fs/proc/array.c
--- linux-2.6/fs/proc/array.c~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/fs/proc/array.c 2003-11-15 23:46:35.000000000 +1100
@@ -154,7 +154,9 @@ static inline char * task_state(struct t
read_lock(&tasklist_lock);
buffer += sprintf(buffer,
"State:\t%s\n"
- "SleepAVG:\t%lu%%\n"
+ "sleep_avg:\t%lu\n"
+ "sleep_time:\t%lu\n"
+ "total_time:\t%lu\n"
"Tgid:\t%d\n"
"Pid:\t%d\n"
"PPid:\t%d\n"
@@ -162,8 +164,8 @@ static inline char * task_state(struct t
"Uid:\t%d\t%d\t%d\t%d\n"
"Gid:\t%d\t%d\t%d\t%d\n",
get_task_state(p),
- (p->sleep_avg/1024)*100/(1000000000/1024),
- p->tgid,
+ p->sleep_avg, p->sleep_time, p->total_time,
+ p->tgid,
p->pid, p->pid ? p->real_parent->pid : 0,
p->pid && p->ptrace ? p->parent->pid : 0,
p->uid, p->euid, p->suid, p->fsuid,
diff -puN include/linux/init_task.h~rollup include/linux/init_task.h
--- linux-2.6/include/linux/init_task.h~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/include/linux/init_task.h 2003-11-15 23:46:35.000000000 +1100
@@ -67,8 +67,8 @@
.usage = ATOMIC_INIT(2), \
.flags = 0, \
.lock_depth = -1, \
- .prio = MAX_PRIO-20, \
- .static_prio = MAX_PRIO-20, \
+ .prio = MAX_PRIO-30, \
+ .static_prio = MAX_PRIO-30, \
.policy = SCHED_NORMAL, \
.cpus_allowed = CPU_MASK_ALL, \
.mm = NULL, \
diff -puN include/linux/sched.h~rollup include/linux/sched.h
--- linux-2.6/include/linux/sched.h~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/include/linux/sched.h 2003-11-15 23:46:35.000000000 +1100
@@ -283,7 +283,7 @@ struct signal_struct {
#define MAX_USER_RT_PRIO 100
#define MAX_RT_PRIO MAX_USER_RT_PRIO

-#define MAX_PRIO (MAX_RT_PRIO + 40)
+#define MAX_PRIO (MAX_RT_PRIO + 59)

#define rt_task(p) ((p)->prio < MAX_RT_PRIO)

@@ -344,14 +344,17 @@ struct task_struct {
struct list_head run_list;
prio_array_t *array;

+ /* Scheduler variables follow. kernel/sched.c */
+ unsigned long array_sequence;
+ unsigned long timestamp;
+
+ unsigned long total_time, sleep_time;
unsigned long sleep_avg;
- long interactive_credit;
- unsigned long long timestamp;
- int activated;

unsigned long policy;
cpumask_t cpus_allowed;
unsigned int time_slice, first_time_slice;
+ unsigned int used_slice;

struct list_head tasks;
struct list_head ptrace_children;
@@ -588,6 +591,7 @@ extern int FASTCALL(wake_up_process(stru
static inline void kick_process(struct task_struct *tsk) { }
#endif
extern void FASTCALL(wake_up_forked_process(struct task_struct * tsk));
+extern void FASTCALL(sched_fork(task_t * p));
extern void FASTCALL(sched_exit(task_t * p));

asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);
diff -puN include/linux/topology.h~rollup include/linux/topology.h
--- linux-2.6/include/linux/topology.h~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/include/linux/topology.h 2003-11-15 23:46:35.000000000 +1100
@@ -54,4 +54,12 @@ static inline int __next_node_with_cpus(
#define for_each_node_with_cpus(node) \
for (node = 0; node < numnodes; node = __next_node_with_cpus(node))

+#ifndef NUMA_FACTOR_BONUS
+/*
+ * High NUMA_FACTOR_BONUS means rare cross-node load balancing. The default
+ * value of 20 means node rebalance after 10 failed local balances,
+ * Should be tuned for each platform in asm/topology.h.
+ */
+#define NUMA_FACTOR_BONUS (HZ/50 ?: 1)
+#endif
#endif /* _LINUX_TOPOLOGY_H */
diff -puN init/main.c~rollup init/main.c
--- linux-2.6/init/main.c~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/init/main.c 2003-11-15 23:46:35.000000000 +1100
@@ -549,7 +549,6 @@ static void do_pre_smp_initcalls(void)

migration_init();
#endif
- node_nr_running_init();
spawn_ksoftirqd();
}

diff -puN kernel/fork.c~rollup kernel/fork.c
--- linux-2.6/kernel/fork.c~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/kernel/fork.c 2003-11-15 23:46:34.000000000 +1100
@@ -958,33 +958,9 @@ struct task_struct *copy_process(unsigne
p->exit_signal = (clone_flags & CLONE_THREAD) ? -1 : (clone_flags & CSIGNAL);
p->pdeath_signal = 0;

- /*
- * Share the timeslice between parent and child, thus the
- * total amount of pending timeslices in the system doesn't change,
- * resulting in more scheduling fairness.
- */
- local_irq_disable();
- p->time_slice = (current->time_slice + 1) >> 1;
- /*
- * The remainder of the first timeslice might be recovered by
- * the parent if the child exits early enough.
- */
- p->first_time_slice = 1;
- current->time_slice >>= 1;
- p->timestamp = sched_clock();
- if (!current->time_slice) {
- /*
- * This case is rare, it happens when the parent has only
- * a single jiffy left from its timeslice. Taking the
- * runqueue lock is not a problem.
- */
- current->time_slice = 1;
- preempt_disable();
- scheduler_tick(0, 0);
- local_irq_enable();
- preempt_enable();
- } else
- local_irq_enable();
+ /* Perform scheduler related accounting */
+ sched_fork(p);
+
/*
* Ok, add it to the run-queues and make it
* visible to the rest of the system.
diff -puN kernel/sched.c~rollup kernel/sched.c
--- linux-2.6/kernel/sched.c~rollup 2003-11-15 23:46:34.000000000 +1100
+++ linux-2.6-npiggin/kernel/sched.c 2003-11-15 23:46:35.000000000 +1100
@@ -14,7 +14,6 @@
* an array-switch method of distributing timeslices
* and per-CPU runqueues. Cleanups and useful suggestions
* by Davide Libenzi, preemptible kernel bits by Robert Love.
- * 2003-09-03 Interactivity tuning by Con Kolivas.
*/

#include <linux/mm.h>
@@ -49,8 +48,8 @@
* to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
* and back.
*/
-#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 20)
-#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 20)
+#define NICE_TO_PRIO(nice) (MAX_RT_PRIO + (nice) + 30)
+#define PRIO_TO_NICE(prio) ((prio) - MAX_RT_PRIO - 30)
#define TASK_NICE(p) PRIO_TO_NICE((p)->static_prio)

/*
@@ -61,134 +60,71 @@
#define USER_PRIO(p) ((p)-MAX_RT_PRIO)
#define TASK_USER_PRIO(p) USER_PRIO((p)->static_prio)
#define MAX_USER_PRIO (USER_PRIO(MAX_PRIO))
-#define AVG_TIMESLICE (MIN_TIMESLICE + ((MAX_TIMESLICE - MIN_TIMESLICE) *\
- (MAX_PRIO-1-NICE_TO_PRIO(0))/(MAX_USER_PRIO - 1)))

/*
- * Some helpers for converting nanosecond timing to jiffy resolution
+ * MIN_TIMESLICE is the timeslice that a minimum priority process gets if there
+ * is a maximum priority process runnable. MAX_TIMESLICE is derived from the
+ * formula in task_timeslice. It cannot be changed here. It is the timesilce
+ * that the maximum priority process will get. Larger timeslices are attainable
+ * by low priority processes however.
*/
-#define NS_TO_JIFFIES(TIME) ((TIME) / (1000000000 / HZ))
-#define JIFFIES_TO_NS(TIME) ((TIME) * (1000000000 / HZ))
+#define MIN_TIMESLICE (1000000 / 1000)
+#define MAX_TIMESLICE (60 * MIN_TIMESLICE) /* do not change this */
+
+/* Maximum amount of history that will be used to calculate priority */
+#define MAX_SLEEP (1000000 / 2)

/*
- * These are the 'tuning knobs' of the scheduler:
- *
- * Minimum timeslice is 10 msecs, default timeslice is 100 msecs,
- * maximum timeslice is 200 msecs. Timeslices get refilled after
- * they expire.
+ * Maximum effect that 1 block of activity (run/sleep/etc) can have. This is
+ * will moderate dicard freak events (eg. SIGSTOP)
*/
-#define MIN_TIMESLICE ( 10 * HZ / 1000)
-#define MAX_TIMESLICE (200 * HZ / 1000)
-#define ON_RUNQUEUE_WEIGHT 30
-#define CHILD_PENALTY 95
-#define PARENT_PENALTY 100
-#define EXIT_WEIGHT 3
-#define PRIO_BONUS_RATIO 25
-#define MAX_BONUS (MAX_USER_PRIO * PRIO_BONUS_RATIO / 100)
-#define INTERACTIVE_DELTA 2
-#define MAX_SLEEP_AVG (AVG_TIMESLICE * MAX_BONUS)
-#define STARVATION_LIMIT (MAX_SLEEP_AVG)
-#define NS_MAX_SLEEP_AVG (JIFFIES_TO_NS(MAX_SLEEP_AVG))
-#define NODE_THRESHOLD 125
-#define CREDIT_LIMIT 100
+#define MAX_SLEEP_AFFECT (MAX_SLEEP/4)
+#define MAX_RUN_AFFECT (MAX_SLEEP/4)
+#define MAX_WAIT_AFFECT (MAX_RUN_AFFECT/2)

/*
- * If a task is 'interactive' then we reinsert it in the active
- * array after it has expired its current timeslice. (it will not
- * continue to run immediately, it will still roundrobin with
- * other interactive tasks.)
- *
- * This part scales the interactivity limit depending on niceness.
- *
- * We scale it linearly, offset by the INTERACTIVE_DELTA delta.
- * Here are a few examples of different nice levels:
- *
- * TASK_INTERACTIVE(-20): [1,1,1,1,1,1,1,1,1,0,0]
- * TASK_INTERACTIVE(-10): [1,1,1,1,1,1,1,0,0,0,0]
- * TASK_INTERACTIVE( 0): [1,1,1,1,0,0,0,0,0,0,0]
- * TASK_INTERACTIVE( 10): [1,1,0,0,0,0,0,0,0,0,0]
- * TASK_INTERACTIVE( 19): [0,0,0,0,0,0,0,0,0,0,0]
- *
- * (the X axis represents the possible -5 ... 0 ... +5 dynamic
- * priority range a task can explore, a value of '1' means the
- * task is rated interactive.)
- *
- * Ie. nice +19 tasks can never get 'interactive' enough to be
- * reinserted into the active array. And only heavily CPU-hog nice -20
- * tasks will be expired. Default nice 0 tasks are somewhere between,
- * it takes some effort for them to get interactive, but it's not
- * too hard.
+ * The amount of history can be decreased (on fork for example). This puts a
+ * lower bound on it.
*/
+#define MIN_HISTORY (MAX_SLEEP/2)

-#define CURRENT_BONUS(p) \
- (NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
- MAX_SLEEP_AVG)
-
-#ifdef CONFIG_SMP
-#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
- (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)) * \
- num_online_cpus())
-#else
-#define TIMESLICE_GRANULARITY(p) (MIN_TIMESLICE * \
- (1 << (((MAX_BONUS - CURRENT_BONUS(p)) ? : 1) - 1)))
-#endif
-
-#define SCALE(v1,v1_max,v2_max) \
- (v1) * (v2_max) / (v1_max)
-
-#define DELTA(p) \
- (SCALE(TASK_NICE(p), 40, MAX_USER_PRIO*PRIO_BONUS_RATIO/100) + \
- INTERACTIVE_DELTA)
-
-#define TASK_INTERACTIVE(p) \
- ((p)->prio <= (p)->static_prio - DELTA(p))
-
-#define JUST_INTERACTIVE_SLEEP(p) \
- (JIFFIES_TO_NS(MAX_SLEEP_AVG * \
- (MAX_BONUS / 2 + DELTA((p)) + 1) / MAX_BONUS - 1))
-
-#define HIGH_CREDIT(p) \
- ((p)->interactive_credit > CREDIT_LIMIT)
+/*
+ * SLEEP_FACTOR is a fixed point factor used to scale history tracking things.
+ * In particular: total_time, sleep_time, sleep_avg.
+ */

-#define LOW_CREDIT(p) \
- ((p)->interactive_credit < -CREDIT_LIMIT)
+#define SLEEP_FACTOR 1024

-#define TASK_PREEMPTS_CURR(p, rq) \
- ((p)->prio < (rq)->curr->prio)
+#define CPU_BALANCE_THRESHOLD 125
+#define NODE_BALANCE_THRESHOLD 125

/*
- * BASE_TIMESLICE scales user-nice values [ -20 ... 19 ]
- * to time slice values.
- *
- * The higher a thread's priority, the bigger timeslices
- * it gets during one round of execution. But even the lowest
- * priority thread gets MIN_TIMESLICE worth of execution time.
- *
- * task_timeslice() is the interface that is used by the scheduler.
+ * The scheduler classifies a process as performing one of the following
+ * activities
*/
+#define STIME_SLEEP 1 /* Sleeping */
+#define STIME_RUN 2 /* Using CPU */
+#define STIME_WAIT 3 /* Waiting for CPU */

-#define BASE_TIMESLICE(p) (MIN_TIMESLICE + \
- ((MAX_TIMESLICE - MIN_TIMESLICE) * (MAX_PRIO-1-(p)->static_prio)/(MAX_USER_PRIO - 1)))
-
-static inline unsigned int task_timeslice(task_t *p)
-{
- return BASE_TIMESLICE(p);
-}
+#define TASK_PREEMPTS_CURR(p, rq) \
+ ( (p)->prio < (rq)->curr->prio )

/*
* These are the runqueue data structures:
*/

-#define BITMAP_SIZE ((((MAX_PRIO+1+7)/8)+sizeof(long)-1)/sizeof(long))
+#define BITMAP_SIZE ((((MAX_PRIO+7)/8)+sizeof(long)-1)/sizeof(long))

typedef struct runqueue runqueue_t;

struct prio_array {
- int nr_active;
+ unsigned int nr_active;
unsigned long bitmap[BITMAP_SIZE];
struct list_head queue[MAX_PRIO];
};

+#define FPT 128 /* fixed point factor */
+
/*
* This is the main, per-CPU runqueue data structure.
*
@@ -198,20 +134,24 @@ struct prio_array {
*/
struct runqueue {
spinlock_t lock;
- unsigned long nr_running, nr_switches, expired_timestamp,
- nr_uninterruptible;
+ unsigned long array_sequence;
+ unsigned long nr_running, nr_switches, nr_uninterruptible;
task_t *curr, *idle;
struct mm_struct *prev_mm;
prio_array_t *active, *expired, arrays[2];
- int prev_cpu_load[NR_CPUS];
-#ifdef CONFIG_NUMA
- atomic_t *node_nr_running;
- int prev_node_load[MAX_NUMNODES];
-#endif
+
task_t *migration_thread;
struct list_head migration_queue;

atomic_t nr_iowait;
+
+#ifdef CONFIG_SMP
+ unsigned long nr_lb_failed;
+ unsigned long cpu_load[NR_CPUS];
+#endif
+#ifdef CONFIG_NUMA
+ unsigned long nr_exec;
+#endif
};

static DEFINE_PER_CPU(struct runqueue, runqueues);
@@ -230,51 +170,32 @@ static DEFINE_PER_CPU(struct runqueue, r
# define task_running(rq, p) ((rq)->curr == (p))
#endif

-#ifdef CONFIG_NUMA
-
/*
* Keep track of running tasks.
*/

-static atomic_t node_nr_running[MAX_NUMNODES] ____cacheline_maxaligned_in_smp =
- {[0 ...MAX_NUMNODES-1] = ATOMIC_INIT(0)};
-
-static inline void nr_running_init(struct runqueue *rq)
+static inline void nr_running_init(int cpu)
{
- rq->node_nr_running = &node_nr_running[0];
}

static inline void nr_running_inc(runqueue_t *rq)
{
- atomic_inc(rq->node_nr_running);
rq->nr_running++;
}

static inline void nr_running_dec(runqueue_t *rq)
{
- atomic_dec(rq->node_nr_running);
rq->nr_running--;
}

-__init void node_nr_running_init(void)
+#define US_TO_JIFFIES(x) (x * HZ / 1000000)
+static inline unsigned long clock_us(void)
{
- int i;
-
- for (i = 0; i < NR_CPUS; i++) {
- if (cpu_possible(i))
- cpu_rq(i)->node_nr_running =
- &node_nr_running[cpu_to_node(i)];
- }
+ unsigned long long ns = sched_clock();
+ do_div(ns, 1000UL);
+ return ns;
}

-#else /* !CONFIG_NUMA */
-
-# define nr_running_init(rq) do { } while (0)
-# define nr_running_inc(rq) do { (rq)->nr_running++; } while (0)
-# define nr_running_dec(rq) do { (rq)->nr_running--; } while (0)
-
-#endif /* CONFIG_NUMA */
-
/*
* task_rq_lock - lock the runqueue a given task resides on and disable
* interrupts. Note the ordering: we can safely lookup the task_rq without
@@ -339,36 +260,121 @@ static inline void enqueue_task(struct t
}

/*
- * effective_prio - return the priority that is based on the static
- * priority but is modified by bonuses/penalties.
- *
- * We scale the actual sleep average [0 .... MAX_SLEEP_AVG]
- * into the -5 ... 0 ... +5 bonus/penalty range.
- *
- * We use 25% of the full 0...39 priority range so that:
- *
- * 1) nice +19 interactive tasks do not preempt nice 0 CPU hogs.
- * 2) nice -20 CPU hogs do not get preempted by nice 0 tasks.
+ * add_task_time updates a task @p after @time of doing the specified @type
+ * of activity. See STIME_*. This is used for priority calculation.
+ */
+
+static inline void add_task_time(task_t *p, unsigned long time, unsigned long type)
+{
+ unsigned long ratio;
+ unsigned long max_affect;
+ unsigned long long tmp;
+
+ if (time == 0)
+ return;
+
+ if (type == STIME_SLEEP)
+ max_affect = MAX_SLEEP_AFFECT;
+ else if (type == STIME_RUN)
+ max_affect = MAX_RUN_AFFECT;
+ else
+ max_affect = MAX_WAIT_AFFECT;
+
+ if (time > max_affect)
+ time = max_affect;
+
+ ratio = MAX_SLEEP - time;
+ tmp = (unsigned long long)ratio*p->total_time + MAX_SLEEP/2;
+ do_div(tmp, MAX_SLEEP);
+ p->total_time = tmp;
+
+ tmp = (unsigned long long)ratio*p->sleep_time + MAX_SLEEP/2;
+ do_div(tmp, MAX_SLEEP);
+ p->sleep_time = tmp;
+
+ if (type != STIME_WAIT) {
+ p->total_time += time;
+ if (type == STIME_SLEEP)
+ p->sleep_time += time;
+
+ p->sleep_avg = (SLEEP_FACTOR * p->sleep_time) / p->total_time;
+ }
+
+ if (p->total_time < MIN_HISTORY) {
+ p->total_time = MIN_HISTORY;
+ p->sleep_time = p->total_time * p->sleep_avg / SLEEP_FACTOR;
+ }
+}
+
+/*
+ * The higher a thread's priority, the bigger timeslices
+ * it gets during one round of execution. But even the lowest
+ * priority thread gets MIN_TIMESLICE worth of execution time.
*
- * Both properties are important to certain workloads.
+ * Timeslices are scaled, so if only low priority processes are running,
+ * they will all get long timeslices.
*/
-static int effective_prio(task_t *p)
+static unsigned int task_timeslice(task_t *p, runqueue_t *rq)
{
- int bonus, prio;
+ int idx, delta;
+ unsigned int base, timeslice;
+
+ if (unlikely(rt_task(p)))
+ return MAX_TIMESLICE;
+
+ idx = min(find_next_bit(rq->active->bitmap, MAX_PRIO, MAX_RT_PRIO),
+ find_next_bit(rq->expired->bitmap, MAX_PRIO, MAX_RT_PRIO));
+ idx = min(idx, p->prio);
+ delta = p->prio - idx;
+
+ /*
+ * This is a bit subtle. The first line establishes a timeslice based
+ * on how far this task is from being the highest priority runnable.
+ * The second line scales this result so low priority tasks will get
+ * big timeslices if higher priority ones are not running.
+ */
+ base = MIN_TIMESLICE * (MAX_USER_PRIO + 1) / (delta + 2);
+ timeslice = base * (USER_PRIO(idx) + 8) / 24;
+
+ if (timeslice <= MIN_TIMESLICE)
+ timeslice = MIN_TIMESLICE;
+
+ return timeslice;
+}
+
+/*
+ * task_priority: calculates a task's priority based on previous running
+ * history (see add_task_time). The priority is just a simple linear function
+ * based on sleep_avg and static_prio.
+ */
+static unsigned long task_priority(task_t *p)
+{
+ unsigned int bonus, prio;

if (rt_task(p))
return p->prio;

- bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
+ bonus = ((MAX_USER_PRIO / 3) * p->sleep_avg + (SLEEP_FACTOR / 2)) / SLEEP_FACTOR;
+ prio = USER_PRIO(p->static_prio) + 10;
+
+ prio = MAX_RT_PRIO + prio - bonus;

- prio = p->static_prio - bonus;
if (prio < MAX_RT_PRIO)
prio = MAX_RT_PRIO;
if (prio > MAX_PRIO-1)
prio = MAX_PRIO-1;
+
return prio;
}

+static inline int task_expired(task_t *p, runqueue_t *rq)
+{
+ unsigned long used = p->used_slice + (clock_us() - p->timestamp);
+ if (used >= task_timeslice(p, rq))
+ return 1;
+
+ return 0;
+}
/*
* __activate_task - move a task to the runqueue.
*/
@@ -378,82 +384,6 @@ static inline void __activate_task(task_
nr_running_inc(rq);
}

-static void recalc_task_prio(task_t *p, unsigned long long now)
-{
- unsigned long long __sleep_time = now - p->timestamp;
- unsigned long sleep_time;
-
- if (__sleep_time > NS_MAX_SLEEP_AVG)
- sleep_time = NS_MAX_SLEEP_AVG;
- else
- sleep_time = (unsigned long)__sleep_time;
-
- if (likely(sleep_time > 0)) {
- /*
- * User tasks that sleep a long time are categorised as
- * idle and will get just interactive status to stay active &
- * prevent them suddenly becoming cpu hogs and starving
- * other processes.
- */
- if (p->mm && p->activated != -1 &&
- sleep_time > JUST_INTERACTIVE_SLEEP(p)){
- p->sleep_avg = JIFFIES_TO_NS(MAX_SLEEP_AVG -
- AVG_TIMESLICE);
- if (!HIGH_CREDIT(p))
- p->interactive_credit++;
- } else {
- /*
- * The lower the sleep avg a task has the more
- * rapidly it will rise with sleep time.
- */
- sleep_time *= (MAX_BONUS - CURRENT_BONUS(p)) ? : 1;
-
- /*
- * Tasks with low interactive_credit are limited to
- * one timeslice worth of sleep avg bonus.
- */
- if (LOW_CREDIT(p) &&
- sleep_time > JIFFIES_TO_NS(task_timeslice(p)))
- sleep_time =
- JIFFIES_TO_NS(task_timeslice(p));
-
- /*
- * Non high_credit tasks waking from uninterruptible
- * sleep are limited in their sleep_avg rise as they
- * are likely to be cpu hogs waiting on I/O
- */
- if (p->activated == -1 && !HIGH_CREDIT(p) && p->mm){
- if (p->sleep_avg >= JUST_INTERACTIVE_SLEEP(p))
- sleep_time = 0;
- else if (p->sleep_avg + sleep_time >=
- JUST_INTERACTIVE_SLEEP(p)){
- p->sleep_avg =
- JUST_INTERACTIVE_SLEEP(p);
- sleep_time = 0;
- }
- }
-
- /*
- * This code gives a bonus to interactive tasks.
- *
- * The boost works by updating the 'average sleep time'
- * value here, based on ->timestamp. The more time a task
- * spends sleeping, the higher the average gets - and the
- * higher the priority boost gets as well.
- */
- p->sleep_avg += sleep_time;
-
- if (p->sleep_avg > NS_MAX_SLEEP_AVG){
- p->sleep_avg = NS_MAX_SLEEP_AVG;
- if (!HIGH_CREDIT(p))
- p->interactive_credit++;
- }
- }
- }
-
- p->prio = effective_prio(p);
-}
-
/*
* activate_task - move a task to the runqueue and do priority recalculation
*
@@ -462,32 +392,26 @@ static void recalc_task_prio(task_t *p,
*/
static inline void activate_task(task_t *p, runqueue_t *rq)
{
- unsigned long long now = sched_clock();
+ unsigned long now = clock_us();
+ unsigned long sleep = now - p->timestamp;
+ p->timestamp = now;
+
+ add_task_time(p, sleep, STIME_SLEEP);

- recalc_task_prio(p, now);
+ p->prio = task_priority(p);

/*
- * This checks to make sure it's not an uninterruptible task
- * that is now waking up.
+ * If we have slept through an active/expired array switch, restart
+ * our timeslice too.
*/
- if (!p->activated){
- /*
- * Tasks which were woken up by interrupts (ie. hw events)
- * are most likely of interactive nature. So we give them
- * the credit of extending their sleep time to the period
- * of time they spend on the runqueue, waiting for execution
- * on a CPU, first time around:
- */
- if (in_interrupt())
- p->activated = 2;
- else
- /*
- * Normal first-time wakeups get a credit too for on-runqueue
- * time, but it will be weighted down:
- */
- p->activated = 1;
- }
- p->timestamp = now;
+ if (rq->array_sequence != p->array_sequence) {
+ p->first_time_slice = 0;
+ p->used_slice = 0;
+ } else if (p->used_slice >= task_timeslice(p, rq)) {
+ enqueue_task(p, rq->expired);
+ nr_running_inc(rq);
+ return;
+ }

__activate_task(p, rq);
}
@@ -497,6 +421,7 @@ static inline void activate_task(task_t
*/
static inline void deactivate_task(struct task_struct *p, runqueue_t *rq)
{
+ p->array_sequence = rq->array_sequence;
nr_running_dec(rq);
if (p->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
@@ -638,18 +563,10 @@ repeat_lock_task:
task_rq_unlock(rq, &flags);
goto repeat_lock_task;
}
- if (old_state == TASK_UNINTERRUPTIBLE){
+ if (old_state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible--;
- /*
- * Tasks on involuntary sleep don't earn
- * sleep_avg beyond just interactive state.
- */
- p->activated = -1;
- }
- if (sync && (task_cpu(p) == smp_processor_id()))
- __activate_task(p, rq);
- else {
- activate_task(p, rq);
+ activate_task(p, rq);
+ if (!sync) {
if (TASK_PREEMPTS_CURR(p, rq))
resched_task(rq->curr);
}
@@ -674,42 +591,94 @@ int wake_up_state(task_t *p, unsigned in
}

/*
+ * Perform scheduler related accounting for a newly forked process @p.
+ * @p is forked by current.
+ */
+void sched_fork(task_t *p)
+{
+ unsigned long ts, left;
+ unsigned long flags;
+ runqueue_t *rq;
+
+ /*
+ * Share the timeslice between parent and child, thus the
+ * total amount of pending timeslices in the system doesn't change,
+ * resulting in more scheduling fairness.
+ */
+ local_irq_disable();
+ p->timestamp = clock_us();
+ rq = task_rq_lock(current, &flags);
+ ts = task_timeslice(current, rq);
+ task_rq_unlock(rq, &flags);
+
+ /*
+ * Share half our timeslice with the child.
+ */
+ left = (current->used_slice + (clock_us() - current->timestamp));
+ if (left > ts)
+ left = 0;
+ else
+ left = ts - left;
+ p->used_slice = left / 2;
+ current->used_slice += (left + 1) / 2;
+
+ /*
+ * The remainder of the first timeslice might be recovered by
+ * the parent if the child exits early enough.
+ */
+ p->first_time_slice = 1;
+ if (unlikely(current->used_slice >= ts)) {
+ /*
+ * This case is rare, it happens when the parent has only
+ * a single jiffy left from its timeslice. Taking the
+ * runqueue lock is not a problem.
+ */
+ preempt_disable();
+ scheduler_tick(0, 0);
+ local_irq_enable();
+ preempt_enable();
+ } else
+ local_irq_enable();
+}
+
+/*
* wake_up_forked_process - wake up a freshly forked process.
*
* This function will do some initial scheduler statistics housekeeping
* that must be done for every newly created process.
*/
-void wake_up_forked_process(task_t * p)
+void wake_up_forked_process(task_t *p)
{
unsigned long flags;
runqueue_t *rq = task_rq_lock(current, &flags);

p->state = TASK_RUNNING;
+
+ set_task_cpu(p, smp_processor_id());
+
/*
- * We decrease the sleep average of forking parents
- * and children as well, to keep max-interactive tasks
- * from forking tasks that are max-interactive.
+ * Get only 1/10th of the parents history. Limited by MIN_HISTORY.
*/
- current->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(current) *
- PARENT_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+ p->total_time = current->total_time / 4;
+ p->sleep_time = current->sleep_time / 4;
+ p->sleep_avg = current->sleep_avg;

- p->sleep_avg = JIFFIES_TO_NS(CURRENT_BONUS(p) *
- CHILD_PENALTY / 100 * MAX_SLEEP_AVG / MAX_BONUS);
+ if (p->total_time < MIN_HISTORY) {
+ p->total_time = MIN_HISTORY;
+ p->sleep_time = p->total_time * p->sleep_avg / SLEEP_FACTOR;
+ }

- p->interactive_credit = 0;
+ /*
+ * Lose 1/4 sleep_time for forking.
+ */
+ current->sleep_time = 3 * current->sleep_time / 4;
+ if (current->total_time != 0)
+ current->sleep_avg = (SLEEP_FACTOR * current->sleep_time)
+ / current->total_time;

- p->prio = effective_prio(p);
- set_task_cpu(p, smp_processor_id());
+ p->prio = task_priority(p);
+ __activate_task(p, rq);

- if (unlikely(!current->array))
- __activate_task(p, rq);
- else {
- p->prio = current->prio;
- list_add_tail(&p->run_list, &current->run_list);
- p->array = current->array;
- p->array->nr_active++;
- nr_running_inc(rq);
- }
task_rq_unlock(rq, &flags);
}

@@ -727,20 +696,28 @@ void sched_exit(task_t * p)
unsigned long flags;

local_irq_save(flags);
+
+ /* Regain the unused timeslice given to @p by its parent */
if (p->first_time_slice) {
- p->parent->time_slice += p->time_slice;
- if (unlikely(p->parent->time_slice > MAX_TIMESLICE))
- p->parent->time_slice = MAX_TIMESLICE;
+ unsigned long ts;
+ unsigned long flags;
+ runqueue_t *rq;
+ rq = task_rq_lock(p, &flags);
+ ts = task_timeslice(p, rq);
+ if (ts > p->used_slice)
+ p->parent->used_slice -= ts - p->used_slice;
+ task_rq_unlock(rq, &flags);
}
+
+ /* Apply some penalty to @p's parent if @p used a lot of CPU */
+ if (p->sleep_avg < p->parent->sleep_avg) {
+ add_task_time(p->parent,
+ MAX_SLEEP * (p->parent->sleep_avg - p->sleep_avg)
+ / SLEEP_FACTOR / 2,
+ STIME_RUN);
+ }
+
local_irq_restore(flags);
- /*
- * If the child was a (relative-) CPU hog then decrease
- * the sleep_avg of the parent as well.
- */
- if (p->sleep_avg < p->parent->sleep_avg)
- p->parent->sleep_avg = p->parent->sleep_avg /
- (EXIT_WEIGHT + 1) * EXIT_WEIGHT + p->sleep_avg /
- (EXIT_WEIGHT + 1);
}

/**
@@ -910,8 +887,158 @@ static inline void double_rq_unlock(runq
spin_unlock(&rq2->lock);
}

+#ifdef CONFIG_SMP
+static inline unsigned long get_cpu_load(int cpu)
+{
+ runqueue_t *rq = cpu_rq(cpu);
+ runqueue_t *this_rq = this_rq();
+ unsigned long nr = FPT * rq->nr_running, load = this_rq->cpu_load[cpu];
+ unsigned long ret = (nr + load) / 2;
+
+ this_rq->cpu_load[cpu] = ret;
+
+ return ret;
+}
+
+static inline unsigned long __get_low_cpu_load(int cpu)
+{
+ runqueue_t *rq = cpu_rq(cpu);
+ runqueue_t *this_rq = this_rq();
+ unsigned long nr = FPT * rq->nr_running, load = this_rq->cpu_load[cpu];
+ return min(nr, load);
+}
+
+static inline unsigned long __get_high_cpu_load(int cpu)
+{
+ runqueue_t *rq = cpu_rq(cpu);
+ runqueue_t *this_rq = this_rq();
+ unsigned long nr = FPT * rq->nr_running, load = this_rq->cpu_load[cpu];
+ return max(nr, load);
+}
+
+static inline unsigned long get_low_cpu_load(int cpu)
+{
+ runqueue_t *rq = cpu_rq(cpu);
+ runqueue_t *this_rq = this_rq();
+ unsigned long nr = FPT * rq->nr_running, load = this_rq->cpu_load[cpu];
+ unsigned long ret = min(nr, load);
+
+ this_rq->cpu_load[cpu] = (nr + load) / 2;
+
+ return ret;
+}
+
+static inline unsigned long get_high_cpu_load(int cpu)
+{
+ runqueue_t *rq = cpu_rq(cpu);
+ runqueue_t *this_rq = this_rq();
+ unsigned long nr = FPT * rq->nr_running, load = this_rq->cpu_load[cpu];
+ unsigned long ret = max(nr, load);
+
+ this_rq->cpu_load[cpu] = (nr + load) / 2;
+
+ return ret;
+}
+#endif
+
#ifdef CONFIG_NUMA
/*
+ * Find the busiest node.
+ */
+static int find_busiest_node(int this_node, unsigned long *imbalance)
+{
+ unsigned long node_loads[MAX_NUMNODES];
+ unsigned long nr_nodes = 0, avg_load = 0, max_load = 0;
+ int i, node = -1;
+
+ if (!nr_cpus_node(this_node))
+ return node;
+
+ for_each_node_with_cpus(i)
+ node_loads[i] = 0;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ int n;
+ if (!cpu_online(i))
+ continue;
+
+ n = cpu_to_node(i);
+ if (n == this_node)
+ node_loads[n] += get_cpu_load(i);
+ else
+ node_loads[n] += get_low_cpu_load(i);
+ }
+
+ for_each_node_with_cpus(i) {
+ node_loads[i] /= nr_cpus_node(i);
+ nr_nodes++;
+ avg_load += node_loads[i];
+
+ if (i == this_node)
+ continue;
+
+ if (max_load < node_loads[i]) {
+ max_load = node_loads[i];
+ node = i;
+ }
+ }
+ avg_load /= nr_nodes;
+
+ if (node_loads[this_node] >= avg_load ||
+ 100*max_load <= NODE_BALANCE_THRESHOLD*node_loads[this_node])
+ return -1;
+
+ *imbalance = min(max_load - avg_load, avg_load - node_loads[this_node]);
+ if (*imbalance < 1*FPT && (max_load - node_loads[this_node]) > 1*FPT)
+ *imbalance = 1*FPT * min(nr_cpus_node(node),
+ nr_cpus_node(this_node));
+ else
+ *imbalance = min((max_load - avg_load) * nr_cpus_node(node),
+ (avg_load - node_loads[this_node])
+ * nr_cpus_node(this_node) );
+ *imbalance = (*imbalance + FPT/2) / FPT;
+
+ return node;
+}
+
+/*
+ * Find the least busy node.
+ */
+static int find_best_node(int this_node)
+{
+ unsigned long node_loads[MAX_NUMNODES];
+ unsigned long min_load = INT_MAX;
+ int i, node = this_node;
+
+ for_each_node_with_cpus(i)
+ node_loads[i] = 0;
+
+ for (i = 0; i < NR_CPUS; i++) {
+ int n;
+ if (!cpu_online(i))
+ continue;
+
+ n = cpu_to_node(i);
+ if (n == this_node)
+ node_loads[n] += get_cpu_load(i);
+ else
+ node_loads[n] += get_low_cpu_load(i);
+ }
+
+ for_each_node_with_cpus(i) {
+ node_loads[i] /= nr_cpus_node(i);
+
+ if (min_load > node_loads[i] + NODE_BALANCE_THRESHOLD*FPT/100 ||
+ (min_load > node_loads[i] && i == this_node)) {
+ min_load = node_loads[i];
+ node = i;
+ }
+ }
+
+ return node;
+}
+
+/*
* If dest_cpu is allowed for this process, migrate the task to it.
* This is accomplished by forcing the cpu_allowed mask to only
* allow dest_cpu, which will force the cpu onto dest_cpu. Then
@@ -937,88 +1064,47 @@ static void sched_migrate_task(task_t *p
*/
static int sched_best_cpu(struct task_struct *p)
{
- int i, minload, load, best_cpu, node = 0;
+ int i, min_load, best_cpu = task_cpu(p), node;
cpumask_t cpumask;

- best_cpu = task_cpu(p);
- if (cpu_rq(best_cpu)->nr_running <= 2)
- return best_cpu;
-
- minload = 10000000;
- for_each_node_with_cpus(i) {
- /*
- * Node load is always divided by nr_cpus_node to normalise
- * load values in case cpu count differs from node to node.
- * We first multiply node_nr_running by 10 to get a little
- * better resolution.
- */
- load = 10 * atomic_read(&node_nr_running[i]) / nr_cpus_node(i);
- if (load < minload) {
- minload = load;
- node = i;
- }
- }
+ node = find_best_node(cpu_to_node(task_cpu(p)));

- minload = 10000000;
+ min_load = INT_MAX;
cpumask = node_to_cpumask(node);
- for (i = 0; i < NR_CPUS; ++i) {
+ for (i = 0; i < NR_CPUS; i++) {
+ unsigned long load;
if (!cpu_isset(i, cpumask))
continue;
- if (cpu_rq(i)->nr_running < minload) {
+ if (i == task_cpu(p))
+ load = get_low_cpu_load(i);
+ else
+ load = get_high_cpu_load(i) + FPT;
+ if (min_load > load) {
best_cpu = i;
- minload = cpu_rq(i)->nr_running;
+ min_load = load;
}
}
return best_cpu;
}

+#define EXEC_BALANCE_INTERVAL 8
void sched_balance_exec(void)
{
int new_cpu;

if (numnodes > 1) {
- new_cpu = sched_best_cpu(current);
- if (new_cpu != smp_processor_id())
- sched_migrate_task(current, new_cpu);
- }
-}
-
-/*
- * Find the busiest node. All previous node loads contribute with a
- * geometrically deccaying weight to the load measure:
- * load_{t} = load_{t-1}/2 + nr_node_running_{t}
- * This way sudden load peaks are flattened out a bit.
- * Node load is divided by nr_cpus_node() in order to compare nodes
- * of different cpu count but also [first] multiplied by 10 to
- * provide better resolution.
- */
-static int find_busiest_node(int this_node)
-{
- int i, node = -1, load, this_load, maxload;
-
- if (!nr_cpus_node(this_node))
- return node;
- this_load = maxload = (this_rq()->prev_node_load[this_node] >> 1)
- + (10 * atomic_read(&node_nr_running[this_node])
- / nr_cpus_node(this_node));
- this_rq()->prev_node_load[this_node] = this_load;
- for_each_node_with_cpus(i) {
- if (i == this_node)
- continue;
- load = (this_rq()->prev_node_load[i] >> 1)
- + (10 * atomic_read(&node_nr_running[i])
- / nr_cpus_node(i));
- this_rq()->prev_node_load[i] = load;
- if (load > maxload && (100*load > NODE_THRESHOLD*this_load)) {
- maxload = load;
- node = i;
+ int this_cpu = smp_processor_id();
+ runqueue_t *this_rq = cpu_rq(this_cpu);
+ this_rq->nr_exec++;
+ if (unlikely(!(this_rq->nr_exec % EXEC_BALANCE_INTERVAL))) {
+ new_cpu = sched_best_cpu(current);
+ if (new_cpu != this_cpu)
+ sched_migrate_task(current, new_cpu);
}
}
- return node;
}

-#endif /* CONFIG_NUMA */
-
+#endif
#ifdef CONFIG_SMP

/*
@@ -1027,31 +1113,28 @@ static int find_busiest_node(int this_no
* this_rq is locked already. Recalculate nr_running if we have to
* drop the runqueue lock.
*/
-static inline unsigned int double_lock_balance(runqueue_t *this_rq,
- runqueue_t *busiest, int this_cpu, int idle, unsigned int nr_running)
+static inline void double_lock_balance(runqueue_t *this_rq,
+ runqueue_t *busiest, int this_cpu)
{
if (unlikely(!spin_trylock(&busiest->lock))) {
if (busiest < this_rq) {
spin_unlock(&this_rq->lock);
spin_lock(&busiest->lock);
spin_lock(&this_rq->lock);
- /* Need to recalculate nr_running */
- if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
- nr_running = this_rq->nr_running;
- else
- nr_running = this_rq->prev_cpu_load[this_cpu];
} else
spin_lock(&busiest->lock);
}
- return nr_running;
}

/*
* find_busiest_queue - find the busiest runqueue among the cpus in cpumask.
*/
-static inline runqueue_t *find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle, int *imbalance, cpumask_t cpumask)
+static inline runqueue_t *
+find_busiest_queue(runqueue_t *this_rq, int this_cpu, int idle,
+ unsigned long *imbalance, cpumask_t cpumask, int local)
{
- int nr_running, load, max_load, i;
+ unsigned long this_load, load, max_load, avg_load;
+ int nr_cpus, i;
runqueue_t *busiest, *rq_src;

/*
@@ -1077,49 +1160,80 @@ static inline runqueue_t *find_busiest_q
* that case we are less picky about moving a task across CPUs and
* take what can be taken.
*/
- if (idle || (this_rq->nr_running > this_rq->prev_cpu_load[this_cpu]))
- nr_running = this_rq->nr_running;
+
+ if (idle == 2)
+ this_load = __get_high_cpu_load(this_cpu);
else
- nr_running = this_rq->prev_cpu_load[this_cpu];
+ this_load = get_high_cpu_load(this_cpu);

busiest = NULL;
- max_load = 1;
+ max_load = this_load;
+ avg_load = this_load;
+ nr_cpus = 1;
+
+ if (idle)
+ max_load = 0;
+
for (i = 0; i < NR_CPUS; i++) {
if (!cpu_isset(i, cpumask))
continue;

+ if (i == this_cpu)
+ continue;
+
rq_src = cpu_rq(i);
- if (idle || (rq_src->nr_running < this_rq->prev_cpu_load[i]))
- load = rq_src->nr_running;
+ if (idle == 2)
+ load = __get_low_cpu_load(i);
else
- load = this_rq->prev_cpu_load[i];
- this_rq->prev_cpu_load[i] = rq_src->nr_running;
+ load = get_low_cpu_load(i);
+
+ nr_cpus++;
+ avg_load += load;

- if ((load > max_load) && (rq_src != this_rq)) {
+ if (load > max_load && rq_src != this_rq) {
busiest = rq_src;
max_load = load;
}
}

if (likely(!busiest))
- goto out;
+ goto out_balance;

- *imbalance = max_load - nr_running;
+ avg_load /= nr_cpus;
+ if (!idle && this_load >= avg_load) {
+ busiest = NULL;
+ goto out_balance;
+ }

- /* It needs an at least ~25% imbalance to trigger balancing. */
- if (!idle && ((*imbalance)*4 < max_load)) {
+ if (!idle && 100*max_load <= CPU_BALANCE_THRESHOLD*this_load) {
busiest = NULL;
- goto out;
+ goto out_balance;
}

- nr_running = double_lock_balance(this_rq, busiest, this_cpu, idle, nr_running);
- /*
- * Make sure nothing changed since we checked the
- * runqueue length.
- */
- if (busiest->nr_running <= nr_running) {
+ double_lock_balance(this_rq, busiest, this_cpu);
+
+ if (busiest->nr_running <= 1) {
spin_unlock(&busiest->lock);
busiest = NULL;
+ if (local)
+ this_rq->nr_lb_failed++;
+ goto out;
+ }
+
+ *imbalance = min(max_load - avg_load, avg_load - this_load);
+ if ( (*imbalance < 1*FPT) && (max_load - this_load) > 1*FPT )
+ *imbalance = 1*FPT;
+ *imbalance = (*imbalance + FPT - 1) / FPT;
+
+ if (*imbalance == 0)
+ *imbalance = 1;
+
+out_balance:
+ if (local) {
+ if (idle)
+ this_rq->nr_lb_failed++;
+ else
+ this_rq->nr_lb_failed = 0;
}
out:
return busiest;
@@ -1131,11 +1245,17 @@ out:
*/
static inline void pull_task(runqueue_t *src_rq, prio_array_t *src_array, task_t *p, runqueue_t *this_rq, int this_cpu)
{
+ unsigned long now = clock_us();
+
dequeue_task(p, src_array);
nr_running_dec(src_rq);
set_task_cpu(p, this_cpu);
nr_running_inc(this_rq);
enqueue_task(p, this_rq->active);
+
+ add_task_time(p, now - p->timestamp, STIME_WAIT);
+ p->timestamp = now;
+
/*
* Note that idle threads have a prio of MAX_PRIO, for this test
* to be always true for them.
@@ -1144,26 +1264,35 @@ static inline void pull_task(runqueue_t
set_need_resched();
}

+#define CACHE_DECAY_US 5000
/*
- * Previously:
- *
- * #define CAN_MIGRATE_TASK(p,rq,this_cpu) \
- * ((!idle || (NS_TO_JIFFIES(now - (p)->timestamp) > \
- * cache_decay_ticks)) && !task_running(rq, p) && \
- * cpu_isset(this_cpu, (p)->cpus_allowed))
+ * can_migrate_task
+ * May task @p from runqueue @rq be migrated to @this_cpu?
+ * Returns: 1 if @p may be migrated, 0 otherwise.
*/
-
static inline int
-can_migrate_task(task_t *tsk, runqueue_t *rq, int this_cpu, int idle)
+can_migrate_task(task_t *p, runqueue_t *rq, int this_cpu, int aggressive)
{
- unsigned long delta = sched_clock() - tsk->timestamp;
+ unsigned long delta;

- if (!idle && (delta <= JIFFIES_TO_NS(cache_decay_ticks)))
- return 0;
- if (task_running(rq, tsk))
+ /*
+ * We do not migrate tasks that are:
+ * 1) running (obviously), or
+ * 2) cannot be migrated to this CPU due to cpus_allowed, or
+ * 3) are cache-hot on their current CPU.
+ */
+
+ if (task_running(rq, p))
return 0;
- if (!cpu_isset(this_cpu, tsk->cpus_allowed))
+
+ if (!cpu_isset(this_cpu, p->cpus_allowed))
return 0;
+
+ /* Aggressive migration if we've failed a balance */
+ delta = clock_us() - p->timestamp;
+ if (!aggressive && delta <= CACHE_DECAY_US)
+ return 0;
+
return 1;
}

@@ -1175,23 +1304,19 @@ can_migrate_task(task_t *tsk, runqueue_t
* We call this with the current runqueue locked,
* irqs disabled.
*/
-static void load_balance(runqueue_t *this_rq, int idle, cpumask_t cpumask)
+static void load_balance(runqueue_t *this_rq, runqueue_t *busiest, unsigned long max_nr_move, int local)
{
- int imbalance, idx, this_cpu = smp_processor_id();
- runqueue_t *busiest;
+ int aggressive = 0;
+ int idx, this_cpu = smp_processor_id();
+ int pulled = 0;
prio_array_t *array;
struct list_head *head, *curr;
task_t *tmp;

- busiest = find_busiest_queue(this_rq, this_cpu, idle, &imbalance, cpumask);
- if (!busiest)
- goto out;
-
- /*
- * We only want to steal a number of tasks equal to 1/2 the imbalance,
- * otherwise we'll just shift the imbalance to the new queue:
- */
- imbalance /= 2;
+ if (max_nr_move <= 0) {
+ spin_unlock(&busiest->lock);
+ return;
+ }

/*
* We first consider expired tasks. Those will likely not be
@@ -1199,6 +1324,7 @@ static void load_balance(runqueue_t *thi
* be cache-cold, thus switching CPUs has the least effect
* on them.
*/
+again:
if (busiest->expired->nr_active)
array = busiest->expired;
else
@@ -1217,6 +1343,10 @@ skip_bitmap:
array = busiest->active;
goto new_array;
}
+ if (!aggressive) {
+ aggressive = 1;
+ goto again;
+ }
goto out_unlock;
}

@@ -1225,23 +1355,23 @@ skip_bitmap:
skip_queue:
tmp = list_entry(curr, task_t, run_list);

- /*
- * We do not migrate tasks that are:
- * 1) running (obviously), or
- * 2) cannot be migrated to this CPU due to cpus_allowed, or
- * 3) are cache-hot on their current CPU.
- */
-
curr = curr->prev;

- if (!can_migrate_task(tmp, busiest, this_cpu, idle)) {
+ if (!can_migrate_task(tmp, busiest, this_cpu, aggressive)) {
if (curr != head)
goto skip_queue;
idx++;
goto skip_bitmap;
}
pull_task(busiest, array, tmp, this_rq, this_cpu);
- if (!idle && --imbalance) {
+ pulled++;
+
+ /*
+ * We only want to steal a number of tasks equal to 1/2 the imbalance,
+ * otherwise we'll just shift the imbalance to the new queue.
+ * Only migrate 1 task if we're idle.
+ */
+ if (pulled < max_nr_move) {
if (curr != head)
goto skip_queue;
idx++;
@@ -1249,10 +1379,40 @@ skip_queue:
}
out_unlock:
spin_unlock(&busiest->lock);
-out:
- ;
+
+ if (local) {
+ if(pulled == 0)
+ this_rq->nr_lb_failed++;
+ else
+ this_rq->nr_lb_failed = 0;
+ }
}

+#ifdef CONFIG_NUMA
+static void node_balance(int this_cpu, runqueue_t *this_rq, unsigned long max_nr_move)
+{
+ unsigned long nr_move;
+ runqueue_t *busiest;
+ cpumask_t cpumask;
+ unsigned long imbalance;
+ int node = find_busiest_node(cpu_to_node(this_cpu), &imbalance);
+ nr_move = min(imbalance, max_nr_move);
+
+ if (node >= 0 && nr_move > 0) {
+ cpumask = node_to_cpumask(node);
+ spin_lock(&this_rq->lock);
+ busiest = find_busiest_queue(this_rq, this_cpu, 0,
+ &imbalance, cpumask, 0);
+ if (busiest) {
+ nr_move = min(nr_move, imbalance);
+ load_balance(this_rq, busiest, nr_move, 0);
+ }
+
+ spin_unlock(&this_rq->lock);
+ }
+}
+#endif
+
/*
* One of the idle_cpu_tick() and busy_cpu_tick() functions will
* get called every timer tick, on every CPU. Our balancing action
@@ -1264,31 +1424,17 @@ out:
*
* On NUMA, do a node-rebalance every 400 msecs.
*/
-#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
-#define BUSY_REBALANCE_TICK (HZ/5 ?: 1)
-#define IDLE_NODE_REBALANCE_TICK (IDLE_REBALANCE_TICK * 5)
-#define BUSY_NODE_REBALANCE_TICK (BUSY_REBALANCE_TICK * 2)
-
-#ifdef CONFIG_NUMA
-static void balance_node(runqueue_t *this_rq, int idle, int this_cpu)
-{
- int node = find_busiest_node(cpu_to_node(this_cpu));
-
- if (node >= 0) {
- cpumask_t cpumask = node_to_cpumask(node);
- cpu_set(this_cpu, cpumask);
- spin_lock(&this_rq->lock);
- load_balance(this_rq, idle, cpumask);
- spin_unlock(&this_rq->lock);
- }
-}
-#endif
+#define IDLE_REBALANCE_TICK (HZ/1000 ?: 1)
+#define BUSY_REBALANCE_TICK (HZ/4 ?: 1)
+#define NUMA_REBALANCE_TICK (HZ/2 ?: 1)
+
+/* Don't have all balancing operations going off at once */
+#define BUSY_CPU_REBALANCE(cpu) (cpu * BUSY_REBALANCE_TICK / NR_CPUS)
+#define NUMA_CPU_REBALANCE(cpu) (cpu * NUMA_REBALANCE_TICK / NR_CPUS)

static void rebalance_tick(runqueue_t *this_rq, int idle)
{
-#ifdef CONFIG_NUMA
int this_cpu = smp_processor_id();
-#endif
unsigned long j = jiffies;

/*
@@ -1299,26 +1445,28 @@ static void rebalance_tick(runqueue_t *t
* node with the current CPU. (ie. other CPUs in the local node
* are not balanced.)
*/
- if (idle) {
-#ifdef CONFIG_NUMA
- if (!(j % IDLE_NODE_REBALANCE_TICK))
- balance_node(this_rq, idle, this_cpu);
-#endif
- if (!(j % IDLE_REBALANCE_TICK)) {
- spin_lock(&this_rq->lock);
- load_balance(this_rq, idle, cpu_to_node_mask(this_cpu));
- spin_unlock(&this_rq->lock);
- }
- return;
- }
+
#ifdef CONFIG_NUMA
- if (!(j % BUSY_NODE_REBALANCE_TICK))
- balance_node(this_rq, idle, this_cpu);
+ if ((j % NUMA_REBALANCE_TICK) == NUMA_CPU_REBALANCE(this_cpu))
+ node_balance(this_cpu, this_rq, INT_MAX);
#endif
- if (!(j % BUSY_REBALANCE_TICK)) {
+
+ if ((idle && !(j % IDLE_REBALANCE_TICK))
+ || (j % BUSY_REBALANCE_TICK) == BUSY_CPU_REBALANCE(this_cpu)) {
+ runqueue_t *busiest;
+ unsigned long imbalance;
+ cpumask_t cpumask = cpu_to_node_mask(this_cpu);
spin_lock(&this_rq->lock);
- load_balance(this_rq, idle, cpu_to_node_mask(this_cpu));
+ busiest = find_busiest_queue(this_rq, this_cpu, idle,
+ &imbalance, cpumask, 1);
+ if (busiest)
+ load_balance(this_rq, busiest, imbalance, 1);
spin_unlock(&this_rq->lock);
+
+#ifdef CONFIG_NUMA
+ if (unlikely(this_rq->nr_lb_failed >= NUMA_FACTOR_BONUS))
+ node_balance(this_cpu, this_rq, INT_MAX);
+#endif
}
}
#else
@@ -1335,20 +1483,6 @@ DEFINE_PER_CPU(struct kernel_stat, kstat
EXPORT_PER_CPU_SYMBOL(kstat);

/*
- * We place interactive tasks back into the active array, if possible.
- *
- * To guarantee that this does not starve expired tasks we ignore the
- * interactivity of a task if the first expired task had to wait more
- * than a 'reasonable' amount of time. This deadline timeout is
- * load-dependent, as the frequency of array switched decreases with
- * increasing number of running tasks:
- */
-#define EXPIRED_STARVING(rq) \
- (STARVATION_LIMIT && ((rq)->expired_timestamp && \
- (jiffies - (rq)->expired_timestamp >= \
- STARVATION_LIMIT * ((rq)->nr_running) + 1)))
-
-/*
* This function gets called by the timer code, with HZ frequency.
* We call it with interrupts disabled.
*
@@ -1365,17 +1499,11 @@ void scheduler_tick(int user_ticks, int
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_ticks);

- /* note: this timer irq context must be accounted for as well */
- if (hardirq_count() - HARDIRQ_OFFSET) {
- cpustat->irq += sys_ticks;
- sys_ticks = 0;
- } else if (softirq_count()) {
- cpustat->softirq += sys_ticks;
- sys_ticks = 0;
- }
-
if (p == rq->idle) {
- if (atomic_read(&rq->nr_iowait) > 0)
+ /* note: this timer irq context must be accounted for as well */
+ if (irq_count() - HARDIRQ_OFFSET >= SOFTIRQ_OFFSET)
+ cpustat->system += sys_ticks;
+ else if (atomic_read(&rq->nr_iowait) > 0)
cpustat->iowait += sys_ticks;
else
cpustat->idle += sys_ticks;
@@ -1398,65 +1526,22 @@ void scheduler_tick(int user_ticks, int
* The task was running during this tick - update the
* time slice counter. Note: we do not update a thread's
* priority until it either goes to sleep or uses up its
- * timeslice. This makes it possible for interactive tasks
- * to use up their timeslices at their highest priority levels.
+ * timeslice.
*/
if (unlikely(rt_task(p))) {
/*
* RR tasks need a special form of timeslice management.
* FIFO tasks have no timeslices.
*/
- if ((p->policy == SCHED_RR) && !--p->time_slice) {
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;
- set_tsk_need_resched(p);
-
- /* put it at the end of the queue: */
- dequeue_task(p, rq->active);
- enqueue_task(p, rq->active);
+ if (p->policy == SCHED_RR) {
+ if (task_expired(p, rq))
+ set_tsk_need_resched(p);
}
goto out_unlock;
}
- if (!--p->time_slice) {
- dequeue_task(p, rq->active);
- set_tsk_need_resched(p);
- p->prio = effective_prio(p);
- p->time_slice = task_timeslice(p);
- p->first_time_slice = 0;

- if (!rq->expired_timestamp)
- rq->expired_timestamp = jiffies;
- if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
- enqueue_task(p, rq->expired);
- } else
- enqueue_task(p, rq->active);
- } else {
- /*
- * Prevent a too long timeslice allowing a task to monopolize
- * the CPU. We do this by splitting up the timeslice into
- * smaller pieces.
- *
- * Note: this does not mean the task's timeslices expire or
- * get lost in any way, they just might be preempted by
- * another task of equal priority. (one with higher
- * priority would have preempted this task already.) We
- * requeue this task to the end of the list on this priority
- * level, which is in essence a round-robin of tasks with
- * equal priority.
- *
- * This only applies to tasks in the interactive
- * delta range with at least TIMESLICE_GRANULARITY to requeue.
- */
- if (TASK_INTERACTIVE(p) && !((task_timeslice(p) -
- p->time_slice) % TIMESLICE_GRANULARITY(p)) &&
- (p->time_slice >= TIMESLICE_GRANULARITY(p)) &&
- (p->array == rq->active)) {
-
- dequeue_task(p, rq->active);
- set_tsk_need_resched(p);
- p->prio = effective_prio(p);
- enqueue_task(p, rq->active);
- }
+ if (task_expired(p, rq)) {
+ set_tsk_need_resched(p);
}
out_unlock:
spin_unlock(&rq->lock);
@@ -1475,7 +1560,7 @@ asmlinkage void schedule(void)
runqueue_t *rq;
prio_array_t *array;
struct list_head *queue;
- unsigned long long now;
+ unsigned long now;
unsigned long run_time;
int idx;

@@ -1484,11 +1569,10 @@ asmlinkage void schedule(void)
* schedule() atomically, we ignore that path for now.
* Otherwise, whine if we are scheduling when we should not be.
*/
- if (likely(!(current->state & (TASK_DEAD | TASK_ZOMBIE)))) {
- if (unlikely(in_atomic())) {
- printk(KERN_ERR "bad: scheduling while atomic!\n");
- dump_stack();
- }
+ if (unlikely(in_atomic()) &&
+ likely(!(current->state & (TASK_DEAD | TASK_ZOMBIE)))) {
+ printk(KERN_ERR "bad: scheduling while atomic!\n");
+ dump_stack();
}

need_resched:
@@ -1497,19 +1581,11 @@ need_resched:
rq = this_rq();

release_kernel_lock(prev);
- now = sched_clock();
- if (likely(now - prev->timestamp < NS_MAX_SLEEP_AVG))
- run_time = now - prev->timestamp;
- else
- run_time = NS_MAX_SLEEP_AVG;
-
- /*
- * Tasks with interactive credits get charged less run_time
- * at high sleep_avg to delay them losing their interactive
- * status
- */
- if (HIGH_CREDIT(prev))
- run_time /= (CURRENT_BONUS(prev) ? : 1);
+ now = clock_us();
+ run_time = now - prev->timestamp;
+ prev->timestamp = now;
+ add_task_time(prev, run_time, STIME_RUN);
+ prev->used_slice += run_time;

spin_lock_irq(&rq->lock);

@@ -1521,14 +1597,37 @@ need_resched:
else
deactivate_task(prev, rq);
}
+ if (unlikely(prev->used_slice >= task_timeslice(prev, rq))) {
+ if (prev->array) {
+ prev->used_slice = 0;
+ prev->first_time_slice = 0;
+ if (unlikely(rt_task(prev)) &&
+ prev->policy == SCHED_RR) {
+ /* put it at the end of the queue: */
+ dequeue_task(prev, prev->array);
+ enqueue_task(prev, rq->active);
+ } else {
+ dequeue_task(prev, prev->array);
+ prev->prio = task_priority(prev);
+ enqueue_task(prev, rq->expired);
+ }
+ }
+ }

#ifdef CONFIG_SMP
- if (unlikely(!rq->nr_running))
- load_balance(rq, 1, cpu_to_node_mask(smp_processor_id()));
+ if (unlikely(!rq->nr_running)) {
+ unsigned long imbalance;
+ runqueue_t *busiest;
+ int cpu = smp_processor_id();
+ busiest = find_busiest_queue(rq, cpu, 2, &imbalance,
+ cpu_to_node_mask(cpu), 1);
+ if (busiest)
+ load_balance(rq, busiest, imbalance, 1);
+ }
#endif
if (unlikely(!rq->nr_running)) {
+ rq->array_sequence++;
next = rq->idle;
- rq->expired_timestamp = 0;
goto switch_tasks;
}

@@ -1537,49 +1636,30 @@ need_resched:
/*
* Switch the active and expired arrays.
*/
+ rq->array_sequence++;
rq->active = rq->expired;
rq->expired = array;
array = rq->active;
- rq->expired_timestamp = 0;
}

idx = sched_find_first_bit(array->bitmap);
queue = array->queue + idx;
next = list_entry(queue->next, task_t, run_list);

- if (next->activated > 0) {
- unsigned long long delta = now - next->timestamp;
-
- if (next->activated == 1)
- delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
-
- array = next->array;
- dequeue_task(next, array);
- recalc_task_prio(next, next->timestamp + delta);
- enqueue_task(next, array);
- }
- next->activated = 0;
switch_tasks:
prefetch(next);
clear_tsk_need_resched(prev);
RCU_qsctr(task_cpu(prev))++;

- prev->sleep_avg -= run_time;
- if ((long)prev->sleep_avg <= 0){
- prev->sleep_avg = 0;
- if (!(HIGH_CREDIT(prev) || LOW_CREDIT(prev)))
- prev->interactive_credit--;
- }
- prev->timestamp = now;
-
if (likely(prev != next)) {
+ add_task_time(next, now - next->timestamp, STIME_WAIT);
+ next->timestamp = now;
if (prev->state == TASK_RUNNING ||
- unlikely(preempt_count() & PREEMPT_ACTIVE))
+ unlikely(preempt_count() & PREEMPT_ACTIVE)) {
prev->nivcsw++;
- else
+ } else
prev->nvcsw++;

- next->timestamp = now;
rq->nr_switches++;
rq->curr = next;

@@ -1593,7 +1673,7 @@ switch_tasks:

reacquire_kernel_lock(current);
preempt_enable_no_resched();
- if (test_thread_flag(TIF_NEED_RESCHED))
+ if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
goto need_resched;
}

@@ -2401,6 +2481,8 @@ asmlinkage long sys_sched_rr_get_interva
int retval = -EINVAL;
struct timespec t;
task_t *p;
+ unsigned long flags;
+ runqueue_t *rq;

if (pid < 0)
goto out_nounlock;
@@ -2415,8 +2497,10 @@ asmlinkage long sys_sched_rr_get_interva
if (retval)
goto out_unlock;

+ rq = task_rq_lock(p, &flags);
jiffies_to_timespec(p->policy & SCHED_FIFO ?
- 0 : task_timeslice(p), &t);
+ 0 : US_TO_JIFFIES(task_timeslice(p, rq)), &t);
+ task_rq_unlock(rq, &flags);
read_unlock(&tasklist_lock);
retval = copy_to_user(interval, &t, sizeof(t)) ? -EFAULT : 0;
out_nounlock:
@@ -2706,12 +2790,11 @@ static int migration_call(struct notifie
unsigned long action,
void *hcpu)
{
- long cpu = (long) hcpu;
+ long cpu = (long)hcpu;
migration_startup_t startup;

switch (action) {
case CPU_ONLINE:
-
printk("Starting migration thread for cpu %li\n", cpu);

startup.cpu = cpu;
@@ -2811,7 +2894,7 @@ void __init sched_init(void)
spin_lock_init(&rq->lock);
INIT_LIST_HEAD(&rq->migration_queue);
atomic_set(&rq->nr_iowait, 0);
- nr_running_init(rq);
+ nr_running_init(i);

for (j = 0; j < 2; j++) {
array = rq->arrays + j;

_

Attachment: signature.asc
Description: This is a digitally signed message part