Re: [RFC] RCU and CONFIG_PREEMPT_RT progress

From: Esben Nielsen
Date: Wed May 11 2005 - 13:20:04 EST


On Tue, 10 May 2005, Paul E. McKenney wrote:

> On Tue, May 10, 2005 at 12:55:31PM +0200, Esben Nielsen wrote:
> [...]
> > >
> > > The current implementation in Ingo's CONFIG_PREEMPT_RT patch uses a
> > > counter-based approach, which seems to work, but which can result in
> > > indefinite-duration grace periods.
> >
> > Is that really a problem in practise? For RCU updates should happen
> > rarely.
>
> Yes. Several people have run into serious problems on small-memory
> machines under denial-of-service workloads. Not pretty.
>
That is what I tell people at work: Dynamic memory allocation is _not_ a
good thing in real-time embedded systems. They call be old fashioned when
I say I want anything important pre-allocated in seperate pools. But it
exactly avoids these kind of surprises, makes the system more
deterministic and makes it easier to find leaks.

> [...]
> >
> > Why bother? What is the overhead of checking relative to just doing the
> > increment/decrement?
>
> The increment of the per-CPU counter pair is likely to incur a cache miss,
> and thus be orders of magnitude more expensive than the increment of
> the variable in the task structure. I did not propose this optimization
> lightly. ;-)
>
> > I had another idea: Do it in the scheduler:
> > per_cpu_rcu_count += prev->rcu_read_count;
> > if(per_cpu_rcu_count==0) {
> > /* grace period encountered */
> > }
> > (do the task switch)
> >
> > per_cpu_rcu_count -= count->rcu_read_count;
> >
> > Then per_cpu_rcu_count is the rcu-read-count for all tasks except the
> > current. During schedule there is no current and therefore it is the total
> > count.
>
> One downside of this approach is that it would maximally expand the
> RCU read-side critical sections -- one of the things we need is to
> avoid holding onto memory longer than necessary, as noted above.
>
> Also, what prevents a grace period from ending while a task is
> preempted in an RCU read-side critical section? Or are you intending
> that synchronize_rcu() scan the task list as well as the per-CPU
> counters?
Eh, if a RCU task is preempted in a read-side CS there IS no grace period!
The trick is to avoid having it haning there for too long (see my boosting
mechanismen below).
>
> > > [...]
> > > 2. #1 above, but use a seqlock to guard the counter selection in
> > > rcu_read_lock(). One potential disadvantage of this approach
> > > is that an extremely unlucky instance of rcu_read_lock() might
> > > be indefinitely delayed by a series of counter flips. I am
> > > concerned that this might actually happen under low-memory
> > > conditions. Also requires memory barriers on the read side,
> > > which we might be stuck with, but still hope to be able to
> > > get rid of. And the per-CPU counter manipulation must use
> > > atomic instructions.
> >
> > Why not just use
> > preempt_disable();
> > manipulate counters;
> > preempt_enable();
> > ??
>
> Because we can race with the counter switch, which can happen on some
> other CPU. Therefore, preempt_disable() does not help with this race.

Well, the per-cpu counters should be truely per-cpu. I.e. all tasks must
_only_ touch the counter on their current cpu. preempt_disable() prevents
the migration of the current task, right?

>
> > > 3. #1 above, but use per-CPU locks to guard the counter selection.
> > > I don't like this any better than #2, worse, in fact, since it
> > > requires expensive atomic instructions as well.
> >
> > local_irqs_disable() andor preempt_disable() should work as they counters
> > are per-cpu, right?
>
> Again, neither of these help with the race against the counter-switch,
> which would likely be happening on some other CPU.
As above :-)

>
> > Or see the alternative above: The counter is per task but latched to a
> > per-cpu on schedule().
>
> This could work (assuming that synchronize_rcu() scanned the task list
> as well as the per-CPU counters), but again would extend grace periods
> too long for small-memory machines subject to denial-of-service attacks.
>
No, need to traverse the list. Again the syncronize_rcu() waits for all
CPU to check their number _locally_. The task doing that can simply check
it's the latched number as it knows it is not in a read-side CS.

> > [...]
> > I was playing around with it a little while back. I didn't sned a patch
> > though as I didn't have time. It works on a UP machine but I don't have a
> > SMP to test on :-(
>
> There are some available from OSDL, right? Or am I out of date here?
Honestly, I can't remember if I mailed it to anyone. I only have the code
on my home PC, which broke down. I am right now running off a backup. I
have attached the patch to this mail. I don't have time to clean it up :-(
I see it also includes code for a testing device trying where I try
to meassure the grace period.

>
> > The ingreedients are:
> > 1) A per-cpu read-side counter, protected locally by preempt_disable().
>
> In order to avoid too-long grace periods, you need a pair of counters.
> If you only have a single counter, you can end up with a series of
> RCU read-side critical-section preemptions resulting in the counter
> never reaching zero, and thus the grace period never terminating.
> Not good, especially in small-memory machines subject to denial-of-service
> attacks.
See the boosting trick below. That will do the trick.

>
> > 2) A task read-side counter (doesn't need protectection at all).
>
> Yep. Also a field in the task struct indicating which per-CPU counter
> applies (unless the scheduler is managing this).
No, it is _always_ the local CPU which alplies as I have tried to make 3)
below (which isn't tested at all).
>
> > 3) Tasks having non-zero read-side counter can't be migrated to other
> > CPUs. That is to make the per-cpu counter truely per-cpu.
>
> If CPU hotplug can be disallowed, then this can work. Do we really want
> to disallow CPU hotplug? Especially given that some implementations
> might want to use CPU hotplug to reduce power consumption? There also
> might be issues with tickless operation.
>
Yeah, I noticed the hotplog code. What are the RT requirements for
hotplugging? With 4) below it shouldn't be a problem to wait until all
tasks ar out of the read-lock session.

A better alternative is to make migration work like this:
1) Deactive the task.
2) Make the target cpu increment it's rcu-counter.
3) Migrate the task.
4) Make the source cpu increment it's rcu-counter.
Then you can safely migrate tasks preempted in read-side CS as well.
Harder to code than my small hack, though.

> > 4) When the scheduler sees a SCHED_OTHER task with a rcu-read-counter>0 it
> > boost the priority to the maximum non-RT priority such it will not be
> > preempted by other SCHED_OTHER tasks. This makes the delay in
> > syncronize_kernel() somewhat deterministic (it depends on how
> > "deterministic" the RT tasks in the system are.)
>
> But one would only want to do this boost if there was a synchronize_rcu()
> waiting -and- if there is some reason to accelerate grace periods (e.g.,
> low on memory), right? If there is no reason to accelerate grace periods,
> then extending them increases efficiency by amortizing grace-period
> overhead over many updates.
Well, I only do the boost if the task is actually preempted in the
read-side CS - it should be relatively rare. I could make a check and
only do the boost if their is an request for a grace-period set. BUT the
problem is that when somebody requests a grace-period later on it would
have to traverse the runqueue and boost the tasks in read-side CS. That
sounds like more expensive, but one can only know that by testing....

>
> > I'll try to send you what I got when I get my computer at home back up.
> > I have only testet id on my UP labtop, where it seems to run fine.
>
> I look forward to seeing it!
>
Attached here :-)
> Thanx, Paul
>

Esben
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/Makefile linux-2.6.12-rc1-RCU/Makefile
--- Orig/linux-2.6.12-rc1/Makefile Sun Mar 20 20:30:55 2005
+++ linux-2.6.12-rc1-RCU/Makefile Sun Mar 27 21:05:58 2005
@@ -1,7 +1,7 @@
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 12
-EXTRAVERSION =-rc1
+EXTRAVERSION =-rc1-RCU-SMP
NAME=Woozy Numbat

# *DOCUMENTATION*
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Kconfig linux-2.6.12-rc1-RCU/drivers/char/Kconfig
--- Orig/linux-2.6.12-rc1/drivers/char/Kconfig Sun Mar 20 20:31:03 2005
+++ linux-2.6.12-rc1-RCU/drivers/char/Kconfig Fri Mar 25 00:19:07 2005
@@ -978,6 +978,13 @@
The mmtimer device allows direct userspace access to the
Altix system timer.

+
+config RCU_TESTER
+ tristate "Test device for testing RCU code."
+ default n
+ help
+ To help debugging the RCU code. Don't include this.
+
source "drivers/char/tpm/Kconfig"

endmenu
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/Makefile linux-2.6.12-rc1-RCU/drivers/char/Makefile
--- Orig/linux-2.6.12-rc1/drivers/char/Makefile Sun Mar 20 20:31:03 2005
+++ linux-2.6.12-rc1-RCU/drivers/char/Makefile Fri Mar 25 00:19:11 2005
@@ -48,6 +48,8 @@
obj-$(CONFIG_VIOTAPE) += viotape.o
obj-$(CONFIG_HVCS) += hvcs.o

+obj-$(CONFIG_RCU_TESTER) += rcu_tester.o
+
obj-$(CONFIG_PRINTER) += lp.o
obj-$(CONFIG_TIPAR) += tipar.o

diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c
--- Orig/linux-2.6.12-rc1/drivers/char/rcu_tester.c Thu Jan 1 01:00:00 1970
+++ linux-2.6.12-rc1-RCU/drivers/char/rcu_tester.c Wed Mar 30 00:15:52 2005
@@ -0,0 +1,120 @@
+/*
+ * test device for testing RCU code
+ */
+
+#include <linux/fs.h>
+#include <linux/miscdevice.h>
+
+#define RCU_TESTER_MINOR 222
+
+#define RCU_TESTER_WRITE 4247
+#define RCU_TESTER_READ 4248
+
+#define ERCU_FAILED 35
+
+#ifndef notrace
+#define notrace
+#endif
+
+u64 notrace get_cpu_tick(void)
+{
+ u64 tsc;
+#ifdef ARCHARM
+ tsc = *oscr;
+#else
+ __asm__ __volatile__("rdtsc" : "=A" (tsc));
+#endif
+ return tsc;
+}
+
+void notrace loop(int loops)
+{
+ int i;
+
+ for (i = 0; i < loops; i++)
+ get_cpu_tick();
+}
+
+static spinlock_t write_lock;
+static int *protected_data = NULL;
+
+static int rcu_tester_open(struct inode *in, struct file *file)
+{
+ return 0;
+}
+
+static long rcu_tester_ioctl(struct file *file,
+ unsigned int cmd, unsigned long args)
+{
+ switch(cmd) {
+ case RCU_TESTER_READ: {
+ int *dataptr, data1,data2;
+ rcu_read_lock();
+ dataptr = protected_data;
+ data1 = *dataptr;
+ loop(args);
+ data2 = *dataptr;
+ rcu_read_unlock();
+ if(data1!=data2)
+ return -ERCU_FAILED;
+ else
+ return 0; /* ok */
+ }
+ case RCU_TESTER_WRITE: {
+ int *newdata, *olddata;
+ newdata = kmalloc(sizeof(int), GFP_KERNEL);
+ if(!newdata)
+ return -ENOMEM;
+
+ spin_lock(&write_lock);
+ olddata = rcu_dereference(protected_data);
+ *newdata = *olddata + 1;
+ rcu_assign_pointer(protected_data, newdata);
+ spin_unlock(&write_lock);
+ synchronize_kernel();
+ kfree(olddata);
+ return 0;
+ }
+ default:
+ return -EINVAL;
+ }
+}
+
+static struct file_operations rcu_tester_fops = {
+ .owner = THIS_MODULE,
+ .llseek = no_llseek,
+ .unlocked_ioctl = rcu_tester_ioctl,
+ .open = rcu_tester_open,
+};
+
+static struct miscdevice rcu_tester_dev =
+{
+ RCU_TESTER_MINOR,
+ "rcu_tester",
+ &rcu_tester_fops
+};
+
+static int __init rcu_tester_init(void)
+{
+ if (misc_register(&rcu_tester_dev))
+ return -ENODEV;
+
+ protected_data = kmalloc(sizeof(int), GFP_KERNEL);
+ *protected_data = 0;
+
+ spin_lock_init(&write_lock);
+
+ return 0;
+}
+
+void __exit rcu_tester_exit(void)
+{
+ printk(KERN_INFO "rcu-tester device uninstalled\n");
+ misc_deregister(&rcu_tester_dev);
+}
+
+module_init(rcu_tester_init);
+module_exit(rcu_tester_exit);
+
+MODULE_LICENSE("GPL");
+
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/init_task.h linux-2.6.12-rc1-RCU/include/linux/init_task.h
--- Orig/linux-2.6.12-rc1/include/linux/init_task.h Sun Mar 20 20:31:28 2005
+++ linux-2.6.12-rc1-RCU/include/linux/init_task.h Fri Mar 25 00:22:00 2005
@@ -74,6 +74,7 @@
.usage = ATOMIC_INIT(2), \
.flags = 0, \
.lock_depth = -1, \
+ .rcu_read_lock_nesting = 0, \
.prio = MAX_PRIO-20, \
.static_prio = MAX_PRIO-20, \
.policy = SCHED_NORMAL, \
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/rcupdate.h linux-2.6.12-rc1-RCU/include/linux/rcupdate.h
--- Orig/linux-2.6.12-rc1/include/linux/rcupdate.h Wed Mar 2 08:37:50 2005
+++ linux-2.6.12-rc1-RCU/include/linux/rcupdate.h Sun Mar 27 18:12:45 2005
@@ -41,6 +41,8 @@
#include <linux/percpu.h>
#include <linux/cpumask.h>
#include <linux/seqlock.h>
+#include <linux/sched.h>
+#include <linux/interrupt.h>

/**
* struct rcu_head - callback structure for use with RCU
@@ -85,6 +87,7 @@
* curlist - current batch for which quiescent cycle started if any
*/
struct rcu_data {
+ int active_readers;
/* 1) quiescent state handling : */
long quiescbatch; /* Batch # for grace period */
int passed_quiesc; /* User-mode/idle loop etc. */
@@ -115,12 +118,14 @@
static inline void rcu_qsctr_inc(int cpu)
{
struct rcu_data *rdp = &per_cpu(rcu_data, cpu);
- rdp->passed_quiesc = 1;
+ if(rdp->active_readers==0)
+ rdp->passed_quiesc = 1;
}
static inline void rcu_bh_qsctr_inc(int cpu)
{
struct rcu_data *rdp = &per_cpu(rcu_bh_data, cpu);
- rdp->passed_quiesc = 1;
+ if(rdp->active_readers==0)
+ rdp->passed_quiesc = 1;
}

static inline int __rcu_pending(struct rcu_ctrlblk *rcp,
@@ -183,14 +188,34 @@
*
* It is illegal to block while in an RCU read-side critical section.
*/
-#define rcu_read_lock() preempt_disable()
+static inline void rcu_read_lock(void)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __get_cpu_var(rcu_data).active_readers++;
+ current->rcu_read_lock_nesting++;
+ local_irq_restore(flags);
+}

/**
* rcu_read_unlock - marks the end of an RCU read-side critical section.
*
* See rcu_read_lock() for more information.
*/
-#define rcu_read_unlock() preempt_enable()
+static inline void rcu_read_unlock(void)
+{
+ unsigned long flags;
+ int cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+
+ per_cpu(rcu_data,cpu).active_readers--;
+ current->rcu_read_lock_nesting--;
+ rcu_qsctr_inc(cpu);
+ local_irq_restore(flags);
+}

/*
* So where is rcu_write_lock()? It does not exist, as there is no
@@ -213,14 +238,34 @@
* can use just rcu_read_lock().
*
*/
-#define rcu_read_lock_bh() local_bh_disable()
+static inline void rcu_read_lock_bh(void)
+{
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __get_cpu_var(rcu_bh_data).active_readers++;
+ current->rcu_read_lock_nesting++;
+ local_irq_restore(flags);
+}

/*
* rcu_read_unlock_bh - marks the end of a softirq-only RCU critical section
*
* See rcu_read_lock_bh() for more information.
*/
-#define rcu_read_unlock_bh() local_bh_enable()
+static inline void rcu_read_unlock_bh(void)
+{
+ unsigned long flags;
+ int cpu;
+
+ local_irq_save(flags);
+ cpu = smp_processor_id();
+
+ per_cpu(rcu_bh_data,cpu).active_readers--;
+ current->rcu_read_lock_nesting--;
+ rcu_bh_qsctr_inc(cpu);
+ local_irq_restore(flags);
+}

/**
* rcu_dereference - fetch an RCU-protected pointer in an
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/include/linux/sched.h linux-2.6.12-rc1-RCU/include/linux/sched.h
--- Orig/linux-2.6.12-rc1/include/linux/sched.h Sun Mar 20 20:31:29 2005
+++ linux-2.6.12-rc1-RCU/include/linux/sched.h Fri Mar 25 00:22:00 2005
@@ -569,6 +569,9 @@
unsigned long ptrace;

int lock_depth; /* Lock depth */
+
+ int rcu_read_lock_nesting; /* How many RCU read reagions the thread is
+ in */

int prio, static_prio;
struct list_head run_list;
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/init/main.c linux-2.6.12-rc1-RCU/init/main.c
--- Orig/linux-2.6.12-rc1/init/main.c Sun Mar 20 20:31:30 2005
+++ linux-2.6.12-rc1-RCU/init/main.c Fri Mar 25 13:36:26 2005
@@ -688,8 +688,11 @@
system_state = SYSTEM_RUNNING;
numa_default_policy();

- if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
- printk(KERN_WARNING "Warning: unable to open an initial console.\n");
+ {
+ int res = sys_open((const char __user *) "/dev/console", O_RDWR, 0);
+ if(res<0)
+ printk(KERN_WARNING "Warning: unable to open an initial console.: retval=%d\n",res);
+ }

(void) sys_dup(0);
(void) sys_dup(0);
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/rcupdate.c linux-2.6.12-rc1-RCU/kernel/rcupdate.c
--- Orig/linux-2.6.12-rc1/kernel/rcupdate.c Wed Mar 2 08:37:30 2005
+++ linux-2.6.12-rc1-RCU/kernel/rcupdate.c Fri Mar 25 13:08:38 2005
@@ -66,7 +66,10 @@
{.lock = SPIN_LOCK_UNLOCKED, .cpumask = CPU_MASK_NONE };

DEFINE_PER_CPU(struct rcu_data, rcu_data) = { 0L };
+EXPORT_PER_CPU_SYMBOL(rcu_data);
DEFINE_PER_CPU(struct rcu_data, rcu_bh_data) = { 0L };
+EXPORT_PER_CPU_SYMBOL(rcu_bh_data);
+

/* Fake initialization required by compiler */
static DEFINE_PER_CPU(struct tasklet_struct, rcu_tasklet) = {NULL};
diff -Naur --exclude-from=diff_exclude Orig/linux-2.6.12-rc1/kernel/sched.c linux-2.6.12-rc1-RCU/kernel/sched.c
--- Orig/linux-2.6.12-rc1/kernel/sched.c Sun Mar 20 20:31:30 2005
+++ linux-2.6.12-rc1-RCU/kernel/sched.c Wed Mar 30 00:06:39 2005
@@ -594,6 +594,9 @@
if (rt_task(p))
return p->prio;

+ if (p->rcu_read_lock_nesting)
+ return MAX_RT_PRIO;
+
bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;

prio = p->static_prio - bonus;
@@ -986,6 +989,9 @@
if (unlikely(task_running(rq, p)))
goto out_activate;

+ if (unlikely(p->rcu_read_lock_nesting))
+ goto out_activate;
+
#ifdef CONFIG_SCHEDSTATS
schedstat_inc(rq, ttwu_cnt);
if (cpu == this_cpu) {
@@ -1644,6 +1650,8 @@
return 0;
if (!cpu_isset(this_cpu, p->cpus_allowed))
return 0;
+ if (p->rcu_read_lock_nesting)
+ return 0;

/*
* Aggressive migration if:
@@ -2633,6 +2641,7 @@
need_resched:
preempt_disable();
prev = current;
+
release_kernel_lock(prev);
need_resched_nonpreemptible:
rq = this_rq();
@@ -2675,6 +2684,7 @@
else {
if (prev->state == TASK_UNINTERRUPTIBLE)
rq->nr_uninterruptible++;
+
deactivate_task(prev, rq);
}
}
@@ -2710,6 +2720,17 @@
}

array = rq->active;
+
+ if( unlikely(prev->rcu_read_lock_nesting) &&
+ prev->prio > MAX_RT_PRIO ) {
+ prio_array_t *prev_array = prev->array;
+ if(prev_array)
+ dequeue_task(prev, prev_array);
+ prev->prio = MAX_RT_PRIO;
+ if(prev_array)
+ enqueue_task_head(prev, prev_array);
+ }
+
if (unlikely(!array->nr_active)) {
/*
* Switch the active and expired arrays.