No 100 HZ timer !

From: schwidefsky@de.ibm.com
Date: Mon Apr 09 2001 - 10:54:37 EST


Hi,
seems like my first try with the complete patch hasn't made it through
to the mailing list. This is the second try with only the common part of
the patch. Here we go (again):

---

I have a suggestion that might seem unusual at first but it is important for Linux on S/390. We are facing the problem that we want to start many (> 1000) Linux images on a big S/390 machine. Every image has its own 100 HZ timer on every processor the images uses (normally 1). On a single image system the processor use of the 100 HZ timer is not a big deal but with > 1000 images you need a lot of processing power just to execute the 100 HZ timers. You quickly end up with 100% CPU only for the timer interrupts of otherwise idle images. Therefore I had a go at the timer stuff and now I have a system running without the 100 HZ timer. Unluckly I need to make changes to common code and I want you opinion on it.

The first problem was how to get rid of the jiffies. The solution is simple. I simply defined a macro that calculates the jiffies value from the TOD clock: #define jiffies ({ \ uint64_t __ticks; \ asm ("STCK %0" : "=m" (__ticks) ); \ __ticks = (__ticks - init_timer_cc) >> 12; \ do_div(__ticks, (1000000/HZ)); \ ((unsigned long) __ticks); \ }) With this define you are independent of the jiffies variable which is no longer needed so I ifdef'ed the definition. There are some places where a local variable is named jiffies. You may not replace these so I renamed them to _jiffies. A kernel compiled with only this change works as always.

The second problem is that you need to be able to find out when the next timer event is due to happen. You'll find a new function "next_timer_event" in the patch which traverses tv1-tv5 and returns the timer_list of the next timer event. It is used in timer_bh to indicate to the backend when the next interrupt should happen. This leads us to the notifier functions. Each time a new timer is added, a timer is modified, or a timer expires the architecture backend needs to reset its timeout value. That is what the "timer_notify" callback is used for. The implementation on S/390 uses the clock comparator and looks like this: static void s390_timer_notify(unsigned long expires) { S390_lowcore.timer_event = ((__u64) expires*CLK_TICKS_PER_JIFFY) + init_timer_cc; asm volatile ("SCKC %0" : : "m" (S390_lowcore.timer_event)); } This causes an interrupt on the cpu which executed s390_timer_notify after "expires" has passed. That means that timer events are spread over the cpus in the system. Modified or deleted timer events do not cause a deletion notification. A cpu might be errornously interrupted to early because of a timer event that has been modified or deleted. But that doesn't do any harm, it is just unnecessary work.

There is a second callback "itimer_notify" that is used to get the per process timers right. We use the cpu timer for this purpose: void set_cpu_timer(void) { unsigned long min_ticks; __u64 time_slice; if (current->pid != 0 && current->need_resched == 0) { min_ticks = current->counter; if (current->it_prof_value != 0 && current->it_prof_value < min_ticks) min_ticks = current->it_prof_value; if (current->it_virt_value != 0 && current->it_virt_value < min_ticks) min_ticks = current->it_virt_value; time_slice = (__u64) min_ticks*CLK_TICKS_PER_JIFFY; asm volatile ("spt %0" : : "m" (time_slice)); } } The cpu timer is a one shot timer that interrupts after the specified amount of time has passed. Not a 100% accurate because VM can schedule the virtual processor before the "spt" has been done but good enough for per process timers.

The remaining changes to common code parts deal with the problem that many ticks may be accounted at once. For example without the 100 HZ timer it is possible that a process runs for half a second in user space. With the next interrupt all the ticks between the last update and the interrupt have to be added to the tick counters. This is why update_wall_time and do_it_prof have changed and update_process_times2 has been introduced.

That leaves three problems: 1) you need to check on every system entry if a tick or more has passed and do the update if necessary, 2) you need to keep track of the elapsed time in user space and in kernel space and 3) you need to check tq_timer every time the system is left and setup a timer event for the next timer tick if there is work to do on the timer queue. These three problems are related and have to be implemented architecture dependent. A nice thing we get for free is that the user/kernel elapsed time measurement gets much more accurate.

The number of interrupts in an idle system due to timer activity drops from from 100 per second on every cpu to about 5-6 on all (!) cpus if this patch is used. Exactly what we want to have.

All this new timer code is only used if the config option CONFIG_NO_HZ_TIMER is set. Without it everything works as always, especially for architectures that will not use it.

Now what do you think?

(See attached file: timer_common)

blue skies, Martin

Linux/390 Design & Development, IBM Deutschland Entwicklung GmbH Schönaicherstr. 220, D-71032 Böblingen, Telefon: 49 - (0)7031 - 16-2247 E-Mail: schwidefsky@de.ibm.com


- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 15 2001 - 21:00:11 EST