Re: Scheduler Bug (set_cpus_allowed)

From: Ingo Molnar (mingo@elte.hu)
Date: Mon Jun 10 2002 - 17:35:38 EST

Next message: Ian Stirling: "2.4.18 stv680.o crashes on unplugging while capturing video."
Previous message: Russell King: "Re: [PATCH] 2.5.21 kill warnings 5/19"
In reply to: Mike Kravetz: "Re: Scheduler Bug (set_cpus_allowed)"
Next in thread: Mike Kravetz: "Re: Scheduler Bug (set_cpus_allowed)"
Reply: Mike Kravetz: "Re: Scheduler Bug (set_cpus_allowed)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 10 Jun 2002, Mike Kravetz wrote:

> Ingo, I saw your patch to remove the frozen lock from the scheduler and
> agree this is the best way to go. Once this change is made, I think it
> is then safe to add a fast path for migration (to set_cpus_allowed) as:
>
> /*
> * If the task is not on a runqueue (and not running), then
> * it is sufficient to simply update the task's cpu field.
> */
> if (!p->array && (p != rq->curr)) {
> p->thread_info->cpu = __ffs(p->cpus_allowed);
> task_rq_unlock(rq, &flags);
> goto out;
> }

yes, if the task is absolutely not running and not even in any runqueue
then indeed we can do this - the scheduler doesnt even know that this task
exists.

The test for 'is this task running or runnable' is interesting. The
!p->array condition covers 99.9% of the cases. But the (p != rq->curr)
test is needed to shield against the following rare path: freshly
deactivated tasks that call into load_balance() might unlock the current
runqueue to aquire runqueue locks in order. We could get rid of the (p !=
rq->curr) test if we deactivated tasks *after* calling load_balance(), but
this is problematic because load_balance() needs to know whether the
current task is going to stay runnable or not.

> Would you agree that this is now safe? My concern is not so much with
> the performance of set_cpus_allowed, but rather in using the same
> concept to 'move' tasks in this state.

this code should be safe, we safely lock the 'potential' runqueue of the
task first, via task_rq_lock(), then we set p->thread_info->cpu to a
different CPU. The only requirement i can see is that this setting of
p->thread_info->cpu must be SMP-atomic - which it currently is.

> Consider the '__wake_up_sync' functionality that existed in the old
> scheduler for pipes. One result of __wake_up_sync is that the reader
> and writer of the pipe were scheduled on the same CPU. This seemed to
> help with pipe bandwidth. Perhaps we could add code something like the
> above to wakeup a task on a specific CPU. This could be used in VERY
> VERY specific cases (such as blocking reader/writer on pipe) to increase
> performance.

agreed. I removed the _sync code mainly because there was no
idle_resched() to migrate a task actively, and the migration bits i tried
were incomplete. But with your above conditional it should cover all the
practical cases we care about, in an elegant way.

i ported your sync wakeup resurrection patch to 2.5.21 (attached). I did
some modifications:

- wake_up() needs to check (rq->curr != p) as well, not only !p->array.

- make __wake_up_sync dependent on CONFIG_SMP

- export __wake_up_sync().

(the attached patch includes both the ->frozen change plus the sync wakeup
resurrection, it's against vanilla 2.5.21.)

appears to work for me just fine (compiles, boots and works under SMP & UP
alike), and does the trick for bw_pipe and lat_pipe. Comments?

Ingo

# This is a BitKeeper generated patch for the following project:
# Project Name: Linux kernel tree
# This patch format is intended for GNU patch command version 2.5 or higher.
# This patch includes the following deltas:
# ChangeSet 1.493 ->
# fs/pipe.c 1.12 -> 1.13
# kernel/ksyms.c 1.96 -> 1.97
# include/linux/sched.h 1.65 -> 1.66
# kernel/sched.c 1.83 -> 1.85
#
# The following is the BitKeeper ChangeSet Log
# --------------------------------------------
# 02/06/10 mingo@elte.hu 1.494
# - get rid of rq->frozen, fix context switch races.
# --------------------------------------------
# 02/06/11 mingo@elte.hu 1.495
# - put the sync wakeup feature back in, based on Mike Kravetz's patch.
# --------------------------------------------
#
diff -Nru a/fs/pipe.c b/fs/pipe.c
--- a/fs/pipe.c Tue Jun 11 00:34:25 2002
+++ b/fs/pipe.c Tue Jun 11 00:34:25 2002
@@ -119,7 +119,7 @@
                  * writers synchronously that there is more
                  * room.
                  */
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
                  kill_fasync(PIPE_FASYNC_WRITERS(*inode), SIGIO, POLL_OUT);
                 if (!PIPE_EMPTY(*inode))
                         BUG();
@@ -219,7 +219,7 @@
                          * is going to give up this CPU, so it doesnt have
                          * to do idle reschedules.
                          */
- wake_up_interruptible(PIPE_WAIT(*inode));
+ wake_up_interruptible_sync(PIPE_WAIT(*inode));
                          kill_fasync(PIPE_FASYNC_READERS(*inode), SIGIO, POLL_IN);
                         PIPE_WAITING_WRITERS(*inode)++;
                         pipe_wait(inode);
diff -Nru a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h Tue Jun 11 00:34:25 2002
+++ b/include/linux/sched.h Tue Jun 11 00:34:25 2002
@@ -491,6 +491,7 @@
extern unsigned long prof_shift;

extern void FASTCALL(__wake_up(wait_queue_head_t *q, unsigned int mode, int nr));
+extern void FASTCALL(__wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr));
extern void FASTCALL(sleep_on(wait_queue_head_t *q));
extern long FASTCALL(sleep_on_timeout(wait_queue_head_t *q,
                                       signed long timeout));
@@ -507,6 +508,11 @@
#define wake_up_interruptible(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
#define wake_up_interruptible_nr(x, nr) __wake_up((x),TASK_INTERRUPTIBLE, nr)
#define wake_up_interruptible_all(x) __wake_up((x),TASK_INTERRUPTIBLE, 0)
+#ifdef CONFIG_SMP
+#define wake_up_interruptible_sync(x) __wake_up_sync((x),TASK_INTERRUPTIBLE, 1)
+#else
+#define wake_up_interruptible_sync(x) __wake_up((x),TASK_INTERRUPTIBLE, 1)
+#endif
asmlinkage long sys_wait4(pid_t pid,unsigned int * stat_addr, int options, struct rusage * ru);

extern int in_group_p(gid_t);
diff -Nru a/kernel/ksyms.c b/kernel/ksyms.c
--- a/kernel/ksyms.c Tue Jun 11 00:34:25 2002
+++ b/kernel/ksyms.c Tue Jun 11 00:34:25 2002
@@ -457,6 +457,9 @@
/* process management */
EXPORT_SYMBOL(complete_and_exit);
EXPORT_SYMBOL(__wake_up);
+#if CONFIG_SMP
+EXPORT_SYMBOL_GPL(__wake_up_sync); /* internal use only */
+#endif
EXPORT_SYMBOL(wake_up_process);
EXPORT_SYMBOL(sleep_on);
EXPORT_SYMBOL(sleep_on_timeout);
diff -Nru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c Tue Jun 11 00:34:25 2002
+++ b/kernel/sched.c Tue Jun 11 00:34:25 2002
@@ -135,7 +135,6 @@
  */
struct runqueue {
         spinlock_t lock;
- spinlock_t frozen;
         unsigned long nr_running, nr_switches, expired_timestamp;
         signed long nr_uninterruptible;
         task_t *curr, *idle;
@@ -322,31 +321,43 @@
  * "current->state = TASK_RUNNING" to mark yourself runnable
  * without the overhead of this.
  */
-static int try_to_wake_up(task_t * p)
+static int try_to_wake_up(task_t * p, int sync)
{
         unsigned long flags;
         int success = 0;
         long old_state;
         runqueue_t *rq;

+repeat_lock_task:
         rq = task_rq_lock(p, &flags);
         old_state = p->state;
- p->state = TASK_RUNNING;
         if (!p->array) {
+ if (unlikely(sync && (rq->curr != p))) {
+ if (p->thread_info->cpu != smp_processor_id()) {
+ p->thread_info->cpu = smp_processor_id();
+ task_rq_unlock(rq, &flags);
+ goto repeat_lock_task;
+ }
+ }
                 if (old_state == TASK_UNINTERRUPTIBLE)
                         rq->nr_uninterruptible--;
                 activate_task(p, rq);
+ /*
+ * If sync is set, a resched_task() is a NOOP
+ */
                 if (p->prio < rq->curr->prio)
                         resched_task(rq->curr);
                 success = 1;
         }
+ p->state = TASK_RUNNING;
         task_rq_unlock(rq, &flags);
+
         return success;
}

int wake_up_process(task_t * p)
{
- return try_to_wake_up(p);
+ return try_to_wake_up(p, 0);
}

void wake_up_forked_process(task_t * p)
@@ -403,7 +414,7 @@
#if CONFIG_SMP || CONFIG_PREEMPT
asmlinkage void schedule_tail(void)
{
- spin_unlock_irq(&this_rq()->frozen);
+ spin_unlock_irq(&this_rq()->lock);
}
#endif

@@ -828,9 +839,6 @@
         if (likely(prev != next)) {
                 rq->nr_switches++;
                 rq->curr = next;
- spin_lock(&rq->frozen);
- spin_unlock(&rq->lock);
-
                 context_switch(prev, next);

                 /*
@@ -840,10 +848,8 @@
                  */
                 mb();
                 rq = this_rq();
- spin_unlock_irq(&rq->frozen);
- } else {
- spin_unlock_irq(&rq->lock);
         }
+ spin_unlock_irq(&rq->lock);

         reacquire_kernel_lock(current);
         preempt_enable_no_resched();
@@ -880,7 +886,7 @@
  * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns
  * zero in this (rare) case, and we handle it by continuing to scan the queue.
  */
-static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+static inline void __wake_up_common(wait_queue_head_t *q, unsigned int mode, int nr_exclusive, int sync)
{
         struct list_head *tmp;
         unsigned int state;
@@ -891,7 +897,7 @@
                 curr = list_entry(tmp, wait_queue_t, task_list);
                 p = curr->task;
                 state = p->state;
- if ((state & mode) && try_to_wake_up(p) &&
+ if ((state & mode) && try_to_wake_up(p, sync) &&
                         ((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive))
                                 break;
         }
@@ -905,17 +911,36 @@
                 return;

         spin_lock_irqsave(&q->lock, flags);
- __wake_up_common(q, mode, nr_exclusive);
+ __wake_up_common(q, mode, nr_exclusive, 0);
+ spin_unlock_irqrestore(&q->lock, flags);
+}
+
+#if CONFIG_SMP
+
+void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr_exclusive)
+{
+ unsigned long flags;
+
+ if (unlikely(!q))
+ return;
+
+ spin_lock_irqsave(&q->lock, flags);
+ if (likely(nr_exclusive))
+ __wake_up_common(q, mode, nr_exclusive, 1);
+ else
+ __wake_up_common(q, mode, nr_exclusive, 0);
         spin_unlock_irqrestore(&q->lock, flags);
}

+#endif
+
void complete(struct completion *x)
{
         unsigned long flags;

         spin_lock_irqsave(&x->wait.lock, flags);
         x->done++;
- __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
+ __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1, 0);
         spin_unlock_irqrestore(&x->wait.lock, flags);
}

@@ -1599,7 +1624,6 @@
                 rq->active = rq->arrays;
                 rq->expired = rq->arrays + 1;
                 spin_lock_init(&rq->lock);
- spin_lock_init(&rq->frozen);
                 INIT_LIST_HEAD(&rq->migration_queue);

                 for (j = 0; j < 2; j++) {

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Ian Stirling: "2.4.18 stv680.o crashes on unplugging while capturing video."
Previous message: Russell King: "Re: [PATCH] 2.5.21 kill warnings 5/19"
In reply to: Mike Kravetz: "Re: Scheduler Bug (set_cpus_allowed)"
Next in thread: Mike Kravetz: "Re: Scheduler Bug (set_cpus_allowed)"
Reply: Mike Kravetz: "Re: Scheduler Bug (set_cpus_allowed)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This archive was generated by hypermail 2b29 : Sat Jun 15 2002 - 22:00:19 EST