Re: [PATCH] updated O(1) scheduler for 2.4

From: Andrea Arcangeli (andrea@suse.de)
Date: Wed May 29 2002 - 14:13:36 EST


On Wed, May 29, 2002 at 11:03:14AM -0700, Robert Love wrote:
> On Wed, 2002-05-29 at 08:15, Andrea Arcangeli wrote:
>
> > I merged it and I've almost finished moving everything on top of it but
> > I've a few issues.
> >
> > can you elaborate why you __save_flags in do_fork? do_fork is even a
> > blocking operation. fork_by_hand runs as well with irq enabled.
> > I don't like to safe flags in a fast path if it's not required.
>
> s/you/Ingo/ ;)
>
> We save flags for two reasons: First, we touch task->time_slice which is
> touched from an interrupt handler (timer/idle/scheduler_tick) and second
> because we may call scheduler_tick which must be called with interrupts
> off. Note we restore them just a few lines later...
>
> Or, hm, are you asking why not just cli/sti? I don't know the answer to

of course, I'm asking why not cli/sti, see my current status of
incrmental fixes on top of the o1 scheduler, some old in -aa, some from
-ac (I dropped an optimization from you, I assume it's ok but quite
frankly I don't care about the performance of setting the cpu affinity
and I preferred to stay safe in sync with -ac and 2.5), some new noticed
while mering (and still uncertain about the questions). also the
force_cpu_reschedule from the 2.5 rcu-poll was buggy and that's fixed
too now in my tree.

beware I cannot test anything yet, the tree still doesn't compile, it seems
the last thing to make uml link is to update the init_task, then I will
also need to do the same for x86-64 at least, and possibly other archs
if Robert didn't took care of them. after x86-64 and uml works I will
check sparc64 alpha and finally x86 smp + up.

diff -urNp o1-sched-ref/include/linux/wait.h o1-sched/include/linux/wait.h
--- o1-sched-ref/include/linux/wait.h Wed May 29 19:58:06 2002
+++ o1-sched/include/linux/wait.h Wed May 29 20:17:12 2002
@@ -59,6 +59,7 @@ typedef struct __wait_queue wait_queue_t
 # define wq_write_lock_irq write_lock_irq
 # define wq_write_lock_irqsave write_lock_irqsave
 # define wq_write_unlock_irqrestore write_unlock_irqrestore
+# define wq_write_unlock_irq write_unlock_irq
 # define wq_write_unlock write_unlock
 #else
 # define wq_lock_t spinlock_t
@@ -71,6 +72,7 @@ typedef struct __wait_queue wait_queue_t
 # define wq_write_lock_irq spin_lock_irq
 # define wq_write_lock_irqsave spin_lock_irqsave
 # define wq_write_unlock_irqrestore spin_unlock_irqrestore
+# define wq_write_unlock_irq spin_unlock_irq
 # define wq_write_unlock spin_unlock
 #endif
 
diff -urNp o1-sched-ref/kernel/sched.c o1-sched/kernel/sched.c
--- o1-sched-ref/kernel/sched.c Wed May 29 20:16:59 2002
+++ o1-sched/kernel/sched.c Wed May 29 20:17:07 2002
@@ -850,15 +850,15 @@ void complete(struct completion *x)
 {
         unsigned long flags;
 
- spin_lock_irqsave(&x->wait.lock, flags);
+ wq_write_lock_irqsave(&x->wait.lock, flags);
         x->done++;
         __wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, 1);
- spin_unlock_irqrestore(&x->wait.lock, flags);
+ wq_write_unlock_irqrestore(&x->wait.lock, flags);
 }
 
 void wait_for_completion(struct completion *x)
 {
- spin_lock_irq(&x->wait.lock);
+ wq_write_lock_irq(&x->wait.lock);
         if (!x->done) {
                 DECLARE_WAITQUEUE(wait, current);
 
@@ -866,14 +866,14 @@ void wait_for_completion(struct completi
                 __add_wait_queue_tail(&x->wait, &wait);
                 do {
                         __set_current_state(TASK_UNINTERRUPTIBLE);
- spin_unlock_irq(&x->wait.lock);
+ wq_write_unlock_irq(&x->wait.lock);
                         schedule();
- spin_lock_irq(&x->wait.lock);
+ wq_write_lock_irq(&x->wait.lock);
                 } while (!x->done);
                 __remove_wait_queue(&x->wait, &wait);
         }
         x->done--;
- spin_unlock_irq(&x->wait.lock);
+ wq_write_unlock_irq(&x->wait.lock);
 }
 
 #define SLEEP_ON_VAR \
@@ -1499,8 +1499,8 @@ typedef struct {
  * is removed from the allowed bitmask.
  *
  * NOTE: the caller must have a valid reference to the task, the
- * task must not exit() & deallocate itself prematurely. The
- * call is not atomic; no spinlocks may be held.
+ * task must not exit() & deallocate itself prematurely. No
+ * spinlocks can be held.
  */
 void set_cpus_allowed(task_t *p, unsigned long new_mask)
 {
@@ -1523,16 +1523,6 @@ void set_cpus_allowed(task_t *p, unsigne
                 return;
         }
 
- /*
- * If the task is not on a runqueue, then it is safe to
- * simply update the task's cpu field.
- */
- if (!p->array) {
- p->cpu = __ffs(p->cpus_allowed);
- task_rq_unlock(rq, &flags);
- return;
- }
-
         init_MUTEX_LOCKED(&req.sem);
         req.task = p;
         list_add(&req.list, &rq->migration_queue);

--- ./kernel/sched.c.~1~ Wed May 29 04:50:30 2002
+++ ./kernel/sched.c Wed May 29 05:22:04 2002
@@ -569,7 +569,7 @@ skip_queue:
 #define CAN_MIGRATE_TASK(p,rq,this_cpu) \
         ((jiffies - (p)->sleep_timestamp > cache_decay_ticks) && \
                 ((p) != (rq)->curr) && \
- ((p)->cpus_allowed & (1 << (this_cpu))))
+ ((p)->cpus_allowed & (1UL << (this_cpu))))
 
         if (!CAN_MIGRATE_TASK(tmp, busiest, this_cpu)) {
                 curr = curr->next;

--- sched2/kernel/sched.c.~1~ Wed May 29 17:24:52 2002
+++ sched2/kernel/sched.c Wed May 29 17:34:22 2002
@@ -248,7 +248,6 @@ static inline void resched_task(task_t *
         int need_resched;
 
         need_resched = p->need_resched;
- wmb();
         set_tsk_need_resched(p);
         if (!need_resched && (p->cpu != smp_processor_id()))
                 smp_send_reschedule(p->cpu);
@@ -794,7 +793,7 @@ switch_tasks:
                  * if the new task was last running on a different
                  * CPU - thus re-load it.
                  */
- mb();
+ smp_mb();
                 rq = this_rq();
                 spin_unlock_irq(&rq->frozen);
         } else {
--- sched2/kernel/fork.c.~1~ Wed May 29 17:24:52 2002
+++ sched2/kernel/fork.c Wed May 29 17:34:39 2002
@@ -712,7 +712,6 @@ int do_fork(unsigned long clone_flags, u
          * total amount of pending timeslices in the system doesnt change,
          * resulting in more scheduling fairness.
          */
- __save_flags(flags);
         __cli();
         if (!current->time_slice)
                 BUG();
@@ -728,7 +727,7 @@ int do_fork(unsigned long clone_flags, u
                 scheduler_tick(0,0);
         }
         p->sleep_timestamp = jiffies;
- __restore_flags(flags);
+ __sti();
 
         /*
          * Ok, add it to the run-queues and make it

diff -urNp o1-sched-ref/include/linux/sched.h o1-sched/include/linux/sched.h
--- o1-sched-ref/include/linux/sched.h Wed May 29 17:36:11 2002
+++ o1-sched/include/linux/sched.h Wed May 29 17:36:32 2002
@@ -371,6 +371,7 @@ struct task_struct {
         unsigned long policy;
         unsigned long cpus_allowed;
         unsigned int time_slice;
+ int get_child_timeslice;
 
         task_t *next_task, *prev_task;
 
diff -urNp o1-sched-ref/kernel/exit.c o1-sched/kernel/exit.c
--- o1-sched-ref/kernel/exit.c Wed May 29 17:36:10 2002
+++ o1-sched/kernel/exit.c Wed May 29 17:36:32 2002
@@ -231,6 +231,7 @@ static inline void forget_original_paren
                         else
                                 p->p_opptr = reaper;
 
+ p->get_child_timeslice = 0;
                         if (p->pdeath_signal) send_sig(p->pdeath_signal, p, 0);
                 }
         }
diff -urNp o1-sched-ref/kernel/fork.c o1-sched/kernel/fork.c
--- o1-sched-ref/kernel/fork.c Wed May 29 17:36:21 2002
+++ o1-sched/kernel/fork.c Wed May 29 17:36:49 2002
@@ -729,6 +729,8 @@ int do_fork(unsigned long clone_flags, u
         }
         p->sleep_timestamp = jiffies;
         __sti();
+ /* Tell the parent if it can get back its timeslice when child exits */
+ p->get_child_timeslice = 1;
 
         /*
          * Ok, add it to the run-queues and make it
diff -urNp o1-sched-ref/kernel/sched.c o1-sched/kernel/sched.c
--- o1-sched-ref/kernel/sched.c Wed May 29 17:36:21 2002
+++ o1-sched/kernel/sched.c Wed May 29 17:36:32 2002
@@ -360,9 +360,11 @@ void wake_up_forked_process(task_t * p)
 void sched_exit(task_t * p)
 {
         __cli();
- current->time_slice += p->time_slice;
- if (unlikely(current->time_slice > MAX_TIMESLICE))
- current->time_slice = MAX_TIMESLICE;
+ if (p->get_child_timeslice) {
+ current->time_slice += p->time_slice;
+ if (unlikely(current->time_slice > MAX_TIMESLICE))
+ current->time_slice = MAX_TIMESLICE;
+ }
         __sti();
         /*
          * If the child was a (relative-) CPU hog then decrease
@@ -673,6 +675,7 @@ void scheduler_tick(int user_tick, int s
                  */
                 if ((p->policy == SCHED_RR) && !--p->time_slice) {
                         p->time_slice = TASK_TIMESLICE(p);
+ p->get_child_timeslice = 0;
                         set_tsk_need_resched(p);
 
                         /* put it at the end of the queue: */
@@ -696,6 +699,7 @@ void scheduler_tick(int user_tick, int s
                 set_tsk_need_resched(p);
                 p->prio = effective_prio(p);
                 p->time_slice = TASK_TIMESLICE(p);
+ p->get_child_timeslice = 0;
 
                 if (!TASK_INTERACTIVE(p) || EXPIRED_STARVING(rq)) {
                         if (!rq->expired_timestamp)

diff -urNp o1-sched/include/linux/sched.h o1-sched-child-first/include/linux/sched.h
--- o1-sched/include/linux/sched.h Wed May 29 17:36:32 2002
+++ o1-sched-child-first/include/linux/sched.h Wed May 29 17:40:12 2002
@@ -509,6 +509,7 @@ extern void set_user_nice(task_t *p, lon
 extern int task_prio(task_t *p);
 extern int task_nice(task_t *p);
 
+extern void __sched_yield(void);
 asmlinkage long sys_sched_yield(void);
 #define yield() sys_sched_yield()
 
diff -urNp o1-sched/kernel/fork.c o1-sched-child-first/kernel/fork.c
--- o1-sched/kernel/fork.c Wed May 29 17:42:11 2002
+++ o1-sched-child-first/kernel/fork.c Wed May 29 17:40:47 2002
@@ -770,12 +770,14 @@ int do_fork(unsigned long clone_flags, u
         ++total_forks;
         if (clone_flags & CLONE_VFORK)
                 wait_for_completion(&vfork);
- else
+ else {
                 /*
                  * Let the child process run first, to avoid most of the
                  * COW overhead when the child exec()s afterwards.
                  */
+ __sched_yield();
                 current->need_resched = 1;
+ }
 
 fork_out:
         return retval;
diff -urNp o1-sched/kernel/sched.c o1-sched-child-first/kernel/sched.c
--- o1-sched/kernel/sched.c Wed May 29 17:36:32 2002
+++ o1-sched-child-first/kernel/sched.c Wed May 29 17:39:57 2002
@@ -1182,7 +1182,7 @@ out_unlock:
         return retval;
 }
 
-asmlinkage long sys_sched_yield(void)
+void __sched_yield(void)
 {
         runqueue_t *rq;
         prio_array_t *array;
@@ -1215,6 +1215,11 @@ asmlinkage long sys_sched_yield(void)
                 __set_bit(current->prio, array->bitmap);
         }
         spin_unlock(&rq->lock);
+}
+
+asmlinkage long sys_sched_yield(void)
+{
+ __sched_yield();
 
         schedule();
 

> that... I would think we always enter do_fork with interrupts enabled,
> but maybe Ingo thought otherwise.
>
> > Then there are longstanding bugs that aren't yet fixed and I ported the
> > fixed on top of it (see the parent-timeslice patch in -aa).
> >
> > the child-run first approch in o1 is suspect, it seems the parent will
> > keep running just after a wasteful reschedule, a sched yield instead
> > should be invoked like in -aa in the share-timeslice patch in order to
> > roll the current->run_list before the schedule is invoked while
> > returning to userspace after fork.
>
> I do not see this...

see the above serie of patches, again it may be broken, still untested yet.

> > another suspect thing I noticed is the wmb() in resched_task. Can you
> > elaborate on what is it trying to serialize (I hope not the read of
> > p->need_resched with the stuff below)? Also if something it should be a
> > smp_wmb(), same smp_ prefix goes for the other mb() in schedule.
>
> I suspect you may be right here. I believe the wmb() is to serialize
> the reading of need_resched vs the writing of need_resched below it vs
> whatever may happen to need_resched elsewhere.

I actually removed it enterely, need_resched will be fore sure read
before overwriting it because it's guaranteed by reading and writing to
the same mem address.

still I wonder if the other cpu will see the need_resched set when it
goes to read it, I can imagine a needed wmb() on the writer cpu and an rmb() in
the reader, hopefully it serializes via the spinlocks, I didn't touch
this area, but if the wmb() was meant to be after need_resced = 1, that
had to be one line below, so still it would be a bug, the wmb() in such
place looks superflous so I dropped it until somebody comments.

>
> But resched_task is the only bit from 2.5 I have not fully back
> ported...take a look at resched_task in 2.5: I need to bring that to
> 2.4. I suspect idle polling is broken in 2.4, too.

your version looks ok at first glance.

Andrea
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Fri May 31 2002 - 22:00:26 EST