Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree

From: Thomas Gleixner
Date: Wed Mar 07 2007 - 16:34:51 EST

Next message: Davide Libenzi: "Re: [patch 1/4] signalfd v1 - signalfd core ..."
Previous message: Linus Torvalds: "Re: [patch 8/6] mm: fix cpdfio vs fault race"
In reply to: Jeremy Fitzhardinge: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Next in thread: Dan Hecht: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, 2007-03-07 at 13:07 -0800, Jeremy Fitzhardinge wrote:
> Thomas Gleixner wrote:
> > I tend to disagree. The clockevents infrastructure was designed to cope
> > with the existing mess of real hardware. The discussion over the last
> > days exposed me to even more exotic designs than the hardware vendors
> > were able to deliver until now.
> >
>
> It's a different but related problem domain. It's also an increasingly
> common execution environment for a kernel to find itself in. Dealing
> with proper paravirtualized timer devices is a big improvement over
> trying to reliably deal with fully virtualized hardware timers, which
> simply can't make the same guarantees that real hardware can make - such
> as "you will definitely get N ns of CPU time between doing the
> delta->absolute computation and programming the match register".

That's exactly the reason why we want only _ONE_ proper virtualized
timer device instead of 10 new variants of broken hardware.

> > I know exactly where you are heading:
> >
> > Offload the handling of hypervisor design decisions to the kernel and
> > let us deal with that. So we need to implement 128 bit math to convert
> > back and forth and I expect more interesting things to creep up.
> >
>
> I wouldn't put it that way. We've been getting a lot of pressure to
> keep the pv_ops interface as small as possible. Reusing existing kernel
> interfaces rather than making up new ones is a good way to do that. The
> clock infrastructure certainly cleans things up; earlier Xen patches
> made a complete copy of the old kernel/time.c and hacked it around,
> which isn't what anyone wants to do.

All you need is exactly ONE paravirt clockevent device and ONE paravirt
clocksource for _ALL_ hypervisors. Cast that into stone with a
paravirt_ops->clockwahtever interface and we are all happy.

> > All this is of _NO_ use and benefit for the kernel itself.
> >
>
> Lots of people want to run Linux in virtual machines. If we can make
> sane kernel changes to help those users, then that is of use an benefit
> to the kernel.

The above will give a real benefit as it is a well defined interface,
which can be verified on both ends.

> > Real hardware copes well with relative deltas for the events, even when
> > it is match register based. I thought long about the support for
> > absolute expiry values in cycles and decided against them to avoid that
> > math hackery, which you folks now demand.
> >
>
> Not really. Xen and VMI interfaces both use absolute monotonic time for
> timeouts, which is certainly a common case for such interfaces
> (pthread_cond_timedwait, for example). Converting delta to absolute is
> clearly simple, but it does introduce an added bit of non-determinism if
> your CPU can be preempted from outside at any time. I presume SMM or
> similar interrupts can cause the same problem on real hardware.

As I said before: I have no objection against expanding / changing the
clockevents interface to deliver absolute expiry time, which we have
already handy.

I just refuse for a good reason to convert it from ktime_t (nanoseconds)
to an absolute cycle value. This can be done on the hypervisor side of
the paravirt clock event device. Same applies for clocksources. The ones
which need nanosecond from/to whatever conversion can do it _IN_ the
hypervisor and not in 10 different grades of madness in the kernel code.

> > We can optimize this by skipping the conversion via a feature flag.

> The clocksource needed the shift for ntp warping. Does the clockevent
> need a shift at all? Could I just set mult/shift to 1/0?

Yes.

> > Your implementation is almost the perfect prototype, if you move the
> > 128 bit hackery into the hypervisor and hide it away from the kernel :)
> >
> The point is to use the tsc to avoid making any hypercalls, so dealing
> with the tsc->ns conversion has to happen on the guest side somehow.

I understand that you want to make this as fast as possible, but TSC is
broken in more than one way and it just makes me barf, when we have yet
another way of dealing with it in the kernel.

Please keep the paravirt interface abstract and treat it in the same way
we treat the kernel - userspace API. The kernel hides all this hardware
crap away from the user space and the same applies for a sane paravirt
interface. This is also a benefit in terms of portability.

For devices, which already live on top of an abstraction layer in the
kernel, e.g. clocksources, clockevents, interrupts, we can share one
implementation accross multiple platforms.

> > One of these is perfectly fine for _ALL_ of the hypervisor folks.
> > Anything else is just a backwards decision for the kernel.
> >
> That would certainly be ideal. We'll look at the xen, vmi, lguest and
> kvm paravirtualized time models and see how much they really have in
> common. I'm a bit curious about how vmi's time events make their way
> back into the system.

By the crude mechanism I'm fighting.

tglx

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Davide Libenzi: "Re: [patch 1/4] signalfd v1 - signalfd core ..."
Previous message: Linus Torvalds: "Re: [patch 8/6] mm: fix cpdfio vs fault race"
In reply to: Jeremy Fitzhardinge: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Next in thread: Dan Hecht: "Re: + stupid-hack-to-make-mainline-build.patch added to -mm tree"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]