Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

From: Dan Hecht
Date: Tue Mar 06 2007 - 16:08:17 EST


On 03/06/2007 02:59 AM, Thomas Gleixner wrote:
On Tue, 2007-03-06 at 00:55 -0800, Zachary Amsden wrote:
a proper CE device also has the added bonus of making high-res timers guests work automatically. It should be simple: just pass it through to your hypervisor, a hyper-CE-device, like a hyper-clocksource device has essentially no guest-side complexity.
It is not so simple. In theory it works great. In reality, the i386 implementation is completely hardwired to work the way hardware works, and breaking the clockevent code out of the deep ties to the APIC is extremely non-trivial. We tried, and could not accomplish it for 2.6.21 because the hrtimers integration was complex, and introduced many bugs for us.

Why is this so non-trivial ? All you have to do is _NOT_ register
PIT/HPET/APIC timers and register a per CPU hyper-CE-device instead,
which uses the hypervisor timer emulation instead of real hardware.

clockevents breaks the hardwired assumptions of the old timer code and
allows you to remove _ALL_ the hardwired hackery in vmitimer.c, i.e.
stuff like

/* Disable PIT. */
outb_p(0x3a, PIT_MODE); /* binary, mode 5, LSB/MSB, ch 0 */


Hmm, I think that the (virtual) bios still will set up the PIT ch 0, and we still need to stop it.

In any case, clockevents doesn't really make it easier nor harder as far as init goes. In the pre-clockevent days, we replace setup_pit_timer, setup_boot_clock, setup_secondary_clock. With clockevents, I think the hook points are the same. Mostly just need to allow the per-cpu lapic_event to be generalized to local_clock_events that can be set to whatever device we want. The other thing on i386 is just some minor annoyances due initially setting up only the PIT on cpu0 on irq 0 and then later setting up per-cpu timer on lvtt, and making this all place nice with paravirt timers. But these are just details and just require some minor changes and will be working, but it just takes some massaging.

So, that is not the real reason to move over the clockevents. The real reason is to use the generic interrupt handlers. We understand that, and will get to that point. In the mean time, we are harming no one. Our code has zero effect when you booting natively or on a non-VMI hypervisor.

We worked around this by keeping NO_IDLE_HZ support, which now you deprecated. So now we are using NO_HZ without a hyper-CE device, and it is working fine. We understand the benefits of moving to the CE model - but it cannot be done overnight.

This is ugly as hell. NO_HZ enables the dyntick functions in idle(),
irq_enter() and irq_exit() so the clockevents code is actually invoked.
I have not looked close enough why this does work at all.


I believe this was just a quick fix in response to Ingo breaking the VMI build yesterday by disabling NO_IDLE_HZ on us. There is no technical reason why NO_IDLE_HZ=y can't coexist with NO_HZ.

(The two work okay together because when using NO_IDLE_HZ, the hooks are deeper in a custom safe_halt routine which isn't registered when using nohz mode at runtime, and conversely, the nohz code is guarded at runtime by the ts->nohz_mode. So, the two really can co-exist at compile time).

Again, no one is arguing that we shouldn't move to clockevents, it's just a matter of time (sorry, no pun intended).

The vmi-time code was introduced to solve some shortcomings of the old (pre-clocksource/clockevents/hrtimer/NO_HZ) i386 timer code that was especially painful for virtualization. Certainly, clocksource/clockevents/NO_HZ solves many of the problems (basically, moving away from counting interrupts to using time sources). e.g. xtime updating is no longer a worry with the new timeofday/clocksource stuff. But there are some that may not quite be solved, listed below. (I know I'm not telling you anything new, but I might as well flesh it out for the other paravirt folks while the code is fresh in my mind):

1) Stolen time (virtual cpu is ready to run but not running): this is handled inconsistently between the various clockevent handlers / CLOCK_EVT_MODE_ONESHOT combinations:

a) tick_handle_periodic / CLOCK_EVT_MODE_PERIODIC: depends on how you define "periodic" timer in a paravirtual world. If you do something like Xen-style where you send periodic events only to running vcpus, then this handler suffers from some of the same problems as the old i386 timer handler:
- jiffies updated according to the number of interrupts you get, so falls behind monotonic time. generally, counting timer interrupts is bad for paravirt.
- process time updated according to the number of interrupts, so falls behind monotonic time. This is probably okay though, since it is essentially tracking (mono - stolen) time. I.e. the missing time is stolen.
- jiffies updated only by boot cpu, which is a problem for paravirt since the boot vcpu can be descheduled while the other vcpus are scheduled.
- can probably just avoid this mode by not advertising PERIODIC capability by your clockevent device.

b) tick_handle_periodic / CLOCK_EVT_MODE_ONESHOT:
- jiffies updated according to monotonic time since we'll loop to catch up the oneshot timer.
- process time accounted in monotonic time, for the same reason. this is probably not what we'd want since it will charge time to whatever process happened to be scheduled in the guest during periods of stolen time.
- Same problem about jiffies only updated by one vcpu.

c) tick_nohz_handler (implies CLOCK_EVT_MODE_ONESHOT):
- jiffies updated according to monotonic time. this is good.
- Process time effectively does not count stolen time since update_process_times is called once per callback but oneshot expiry is advanced into the future. This is probably the right thing for paravirt, but, inconsistent with (b).
- jiffies updated by all vcpus, which is good.

d) hrtimer_interrupt (really tick_sched_timer):
- w.r.t. stolen time, will be similar to (c). We'll advance the sched_timer to the next period in the future, skipping stolen time for process accounting.
- jiffies will be kept up to date with monotonic time.
- jiffies updated by all vcpus, which is good.

Summary: Cases (c) and (d) should be relatively well behaved for paravirt. So, if we can depend on NO_HZ=y and not being disabled at runtime, we should be okay. We may want to have some knowledge of stolen time passed from the hypervisor (if we wanted more accurate process time accounting), but this can be put back in later, and isn't too important with sample based accounting system like Linux. But, we still need to QA all four cases, and all four cases will likely expose different bugs due to second order effects.

2) Virtual interrupts have a relatively high overhead as compared with native interrupts. So, in vmitime, we wanted to be able to lower the timer interrupt rate at runtime, even if HZ is a compile time constant (and set to something high, like 1000hz). While we could hack this in by using evt->min_delta_ns, it wouldn't really work since process time accounting would be wrong. Instead, we should allow the tick_sched_timer in cases (c) and (d) to have runtime configurable period, and then scale the time value accordingly before passing to account_system_time. This is probably something the Xen folks will want also, since I think Xen itself only gets 100hz hard timer, and so it can implement at best a oneshot virtual timer with 100hz resolution. Any objections to us doing something like this?

3) clockevent set_next_event interface is suboptimal for paravirt (and probably realtime-ish uses). The problem is that the expiry is passed as a relative time. On paravirt, an arbitrary amount of (stolen) time may have passed since the delta was computed and when the timer device is programmed, causing that next interrupt to be too far out in the future. It seems a better interface for set_next_event would be to pass the current time and the absolute expiry. Actually, I sent email to Thomas and Ingo about this (and some other clockevents/hrtimer feedback) in July 2006, but never heard back. Thoughts?

I think all the other important points that the vmitime code addressed are also addressed by clocksource/clockevents/NO_HZ.

Finally, while I agree that writing the clockevent callback code is trivial, we will hit bugs when moving over. It is something we need to do, but just takes some time for us to test and shake out the bugs. For example, we are seeing this bug. It seems to me that tick_sched_timer should not be running in softirq context, but only from the hrtimer_interrupt. Is that right? I'm not sure how we get into this case.

Switched to high resolution mode on CPU 0
------------[ cut here ]------------
kernel BUG at /dhecht/linux/testing/torvalds/linux-2.6.21/kernel/posix-cpu-timers.c:1295!
invalid opcode: 0000 [#1]
SMP
Modules linked in:
CPU: 0
EIP: 0062:[<c0137801>] Not tainted VLI
EFLAGS: 00010202 (2.6.21-rc2-smp #29)
EIP is at run_posix_cpu_timers+0x24/0x6a7
eax: 00000200 ebx: c1405be0 ecx: c1405102 edx: c14051d4
esi: c1405ca0 edi: 77ad5c64 ebp: dfed5a50 esp: dfe81db4
ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 006a
Process swapper (pid: 1, ti=dfe80000 task=dfed5a50 task.ti=dfe80000)
Stack: 00000078 c039e760 0000007d c010a885 28f8c4ec 00000000 dfed5a50 c14051c0
dfe81df8 c0122ac3 c14051cc 00000003 00000008 c14051d4 dfed5a50 00000000
dfe81df4 dfe81df4 c1405be0 c1405ca0 77ad5c64 00000000 c013cd3a c1405c18
Call Trace:
[<c010a885>] sched_clock+0x3d/0x69
[<c0122ac3>] scheduler_tick+0x52/0xc9
[<c013cd3a>] tick_sched_timer+0x57/0x9d
[<c013cce3>] tick_sched_timer+0x0/0x9d
[<c013936a>] hrtimer_run_queues+0x1c3/0x21d
[<c012d69b>] run_timer_softirq+0x21/0x177
[<c012a0bd>] __do_softirq+0x66/0xca
[<c012a164>] do_softirq+0x43/0x51
[<c012a26b>] irq_exit+0x38/0x6b
[<c0107ab5>] do_IRQ+0x82/0x99
[<c025cf44>] serial8250_console_write+0x129/0x136
[<c025ec10>] serial8250_console_putchar+0x0/0x78
[<c0105d13>] common_interrupt+0x23/0x30
[<c01267f9>] vprintk+0x29b/0x2ca
[<c02056a8>] idr_get_new_above_int+0x13c/0x216
[<c0126843>] printk+0x1b/0x1f
[<c03ebf31>] audit_init+0x26/0x101
[<c03ebe7e>] ikconfig_init+0x14/0x37
[<c03da92d>] init+0x145/0x23e
[<c0104c06>] ret_from_fork+0x6/0x20
[<c0104d69>] syscall_exit+0x5/0x18
[<c03da7e8>] init+0x0/0x23e
[<c03da7e8>] init+0x0/0x23e
[<c0105fe7>] kernel_thread_helper+0x7/0x10
=======================
Code: e9 b0 fe ff ff 5b c3 55 57 56 53 83 ec 48 89 c5 8d 44 24 40 89 44 24 40 89 44 24 44 e8 19 9a ec 3b 90 8d 74 26 00 f6 c4 02 74 04 <0f> 0b eb fe 8b 95 24 01 00 00 85 d2 74 10 8b 85 08 01 00 00 03
EIP: [<c0137801>] run_posix_cpu_timers+0x24/0x6a7 SS:ESP 006a:dfe81db4
Kernel panic - not syncing: Fatal exception in interrupt


thanks,
Dan
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/