Virtualization is about not doing that. Sometimes it's necessary (whenThe problem is that your continued assertion that there is no advantage
you have made unfixable design mistakes), but just to replace a bus,
with no advantages to the guest that has to be changed (other
hypervisors or hypervisorless deployment scenarios aren't).
to the guest is a completely unsubstantiated claim. As it stands right
now, I have a public git tree that, to my knowledge, is the fastest KVM
PV networking implementation around. It also has capabilities that are
demonstrably not found elsewhere, such as the ability to render generic
shared-memory interconnects (scheduling, timers), interrupt-priority
(qos), and interrupt-coalescing (exit-ratio reduction). I designed each
of these capabilities after carefully analyzing where KVM was coming up
short.
Those are facts.
I can't easily prove which of my new features alone are what makes it
special per se, because I don't have unit tests for each part that
breaks it down. What I _can_ state is that its the fastest and most
feature rich KVM-PV tree that I am aware of, and others may download and
test it themselves to verify my claims.
The disproof, on the other hand, would be in a counter example that
still meets all the performance and feature criteria under all the same
conditions while maintaining the existing ABI. To my knowledge, this
doesn't exist.
Therefore, if you believe my work is irrelevant, show me a git tree that
accomplishes the same feats in a binary compatible way, and I'll rethink
my position. Until then, complaining about lack of binary compatibility
is pointless since it is not an insurmountable proposition, and the one
and only available solution declares it a required casualty.
Well, Xen requires pre-translation (since the guest has to give the hostActually I am not sure that it does require pre-translation. You might
(which is just another guest) permissions to access the data).
be able to use the memctx->copy_to/copy_from scheme in post translation
as well, since those would be able to communicate to something like the
xen kernel. But I suppose either method would result in extra exits, so
there is no distinct benefit using vbus there..as you say below "they're
just different".
The biggest difference is that my proposed model gets around the notion
that the entire guest address space can be represented by an arbitrary
pointer. For instance, the copy_to/copy_from routines take a GPA, but
may use something indirect like a DMA controller to access that GPA. On
the other hand, virtio fully expects a viable pointer to come out of the
interface iiuc. This is in part what makes vbus more adaptable to non-virt.
Yes, but that has its own problems: e.g. additional exits or at leastAn interesting thing here is that you don't even need a fancyYou can simply use the same vector for both rx and tx and poll both at
multi-homed setup to see the effects of my exit-ratio reduction work:
even single port configurations suffer from the phenomenon since many
devices have multiple signal-flows (e.g. network adapters tend to have
at least 3 flows: rx-ready, tx-complete, and control-events (link-state,
etc). Whats worse, is that the flows often are indirectly related (for
instance, many host adapters will free tx skbs during rx operations, so
you tend to get bursts of tx-completes at the same time as rx-ready. If
the flows map 1:1 with IDT, they will suffer the same problem.
every interrupt.
additional overhead figuring out what happens each time.
This is even
more important as we scale out to MQ which may have dozens of queue
pairs. You really want finer grained signal-path decode if you want
peak performance.
Good point. You are probably right. Certainly the other 2 remain, however.Its important to note here that we are actually looking at the interrupt(irq window exits should only be required on a small percentage of
rate, not the exit rate (which is usually a multiple of the interrupt
rate, since you have to factor in as many as three exits per interrupt
(IPI, window, EOI). Therefore we saved about 18k interrupts in this 10
second burst, but we may have actually saved up to 54k exits in the
process. This is only over a 10 second window at GigE rates, so YMMV.
These numbers get even more dramatic on higher end hardware, but I
haven't had a chance to generate new numbers yet.
interrupt injections, since the guest will try to disable interrupts for
short periods only)
Ultimately, the fastest exit is the one you do not take. That is what I
am trying to achieve.
(see "softirqs" above)The even worse news for 1:1 models is that the ratio ofRequiring all three exits means the guest is spending most of its time
exits-per-interrupt climbs with load (exactly when it hurts the most)
since that is when the probability that the vcpu will need all three
exits is the highest.
with interrupts disabled; that's unlikely.
Thanks for the numbers. Are those 11% attributable to rx/txIts hard to tell, since I am not instrumented to discern the difference
piggybacking from the same interface?
in this run. I do know from previous traces on the 10GE rig that the
chelsio T3 that I am running reaps the pending-tx ring at the same time
as a rx polling, so its very likely that both events are often
coincident at least there.
Also, 170K interupts -> 17K interrupts/sec -> 55kbit/interrupt ->I am not following: Do you suspect that I have too few interrupts to
6.8kB/interrupt. Ignoring interrupt merging and assuming equal rx/tx
distribution, that's about 13kB/interrupt. Seems rather low for a
saturated link.
represent 940Mb/s, or that I have too little data/interrupt and this
ratio should be improved?
While MSI is a good technological advancement for PCI, I was referringWith standard PCI, they do not. But all modern host adapters supportEveryone is of course entitled to an opinion, but the industry as a
whole would disagree with you. Signal path routing (1:1, aggregated,
etc) is at the discretion of the bus designer. Most buses actually do
_not_ support 1:1 with IDT (think USB, SCSI, IDE, etc).
MSI and they will happily give you one interrupt per queue.
to signal:IDT ratio. MSI would still classify as 1:1.
Let's do that then. Please reserve the corresponding comparisons fromThat is quite the odd request. My graphs are all built using readily
your side as well.
available code and open tools and do not speculate as to what someone
else may come up with in the future. They reflect what is available
today. Do you honestly think I should wait indefinitely for a competing
idea to try to catch up before I talk about my results? That's
certainly an interesting perspective.
With all due respect, the only red-herring is your unsubstantiated
claims that my results do not matter.
We are working on real-time, IB and QOS, for examples, in addition toThis is not to mentionWhat are scheduler coordination and non-802.x fabrics?
that vhost-net does nothing to address our other goals, like scheduler
coordination and non-802.x fabrics.
the now well known 802.x venet driver.
I'm serious. Where doesn't it fit? Point me at a URL if its already(avoiding infinite loop)Right, when you ignore the points where they don't fit, it's a perfectWhere doesn't it fit?
mesh.
discussed.
Not exactly. It kind of works for 802.x only (albeit awkwardly) becauseCitation please. Afaict, the one use case that we looked at for vhostI think Ira said he can make vhost work?
outside of KVM failed to adapt properly, so I do not see how this is
true.
there is no strong distinction between "resource" and "consumer" with
ethernet. So you can run it inverted without any serious consequences
(at least, not from consequences of the inversion). Since the x86
boards are the actual resource providers in his system, other device
types will fail to map to the vhost model properly, like disk-io or
consoles for instance.
virtio-net over pci is deployed. Replacing the backend with vhost-netThat _is_ a nice benefit, I agree. I just do not agree its a hard
will require no guest modifications.
requirement.
Obviously virtio-net isn't deployed in non-virt. But if we adopt vbus,As a first step, lets just shoot for "support" instead of "adopt".
we have to migrate guests.
Ill continue to push patches to you that help interfacing with the guest
in a vbus neutral way (like irqfd/ioeventfd) and we can go from there.
Are you open to this work assuming it passes normal review cycles, etc?
It would presumably be of use to others that want to interface to a
guest (e.g. vhost) as well.
Well, it does when you really look closely at how it works. For one, itAnd once those events are fed, you still need avirtio-net doesn't use any pv layer.
PV layer to actually handle the bus interface in a high-performance
manner so its not like you really have a "native" stack in either case.
has the virtqueues library that would be (or at least _should be_)
common for all virtio-X adapters, etc etc. Even if this layer is
collapsed into each driver on the Windows platform, its still there
nonetheless.
Sure it does. It doesn't use MMIO/PIO bars for registers, it usesvirtio-net doesn't modify the PCI model.that doesn't need to be retrofitted.No, that is incorrect. You have to heavily modify the pci model with
layers on top to get any kind of performance out of it. Otherwise, we
would just use realtek emulation, which is technically the native PCI
you are apparently so enamored with.
vq->kick().
It doesn't use pci-config-space, it uses virtio->features.
It doesn't use PCI interrupts, it uses a callback on the vq etc, etc.
You would never use raw "registers", as the exit rate would crush you.
You would never use raw interrupts, as you need a shared-memory based
mitigation scheme.
IOW: Virtio has a device model layer that tunnels over PCI. It doesn't
actually use PCI directly. This is in fact what allows the linux
version to work over lguest, s390 and vbus in addition to PCI.
You can have dynamic MSI/queue routing with virtio, and each MSI can beCan you arbitrarily create a new MSI/queue on a per-device basis on the
routed to a vcpu at will.
fly? We want to do this for some upcoming designs. Or do you need to
predeclare the vectors when the device is hot-added?
The APIC model is not optimal for PV given the exits required for apriority, and coalescing, etc.Do you mean interrupt priority? Well, apic allows interrupt priorities
and Windows uses them; Linux doesn't. I don't see a reason to provide
more than native hardware.
basic operation like an interrupt injection, and has scaling/flexibility
issues with its 16:16 priority mapping.
OTOH, you don't necessarily want to rip it out because of all the
additional features it has like the IPI facility and the handling of
many low-performance data-paths. Therefore, I am of the opinion that
the optimal placement for advanced signal handling is directly at the
bus that provides the high-performance resources. I could be convinced
otherwise with a compelling argument, but I think this is the path of
least resistance.
N:1 breaks down on large guests since one vcpu will have to process allWell, first of all that is not necessarily true. Some high performance
events.
buses like SCSI and FC work fine with an aggregated model, so its not a
foregone conclusion that aggregation kills SMP IO performance. This is
especially true when you adding coalescing on top, like AlacrityVM does.
I do agree that other subsystems, like networking for instance, may
sometimes benefit from flexible signal-routing because of multiqueue,
etc, for particularly large guests. However, the decision to make the
current kvm-connector used in AlacrityVM aggregate one priority FIFO per
IRQ was an intentional design tradeoff. My experience with my target
user base is that these data-centers are typically deploying 1-4 vcpu
guests, so I optimized for that. YMMV, so we can design a different
connector, or a different mode of the existing connector, to accommodate
large guests as well if that was something desirable.
You could do N:M, with commands to change routings, but where'sWell, we should be able to add that when/if its needed. I just don't
your userspace interface?
think the need is there yet. KVM tops out at 16 IIUC anyway.
you can't tell from /proc/interrupts whichThis should be trivial to add some kind of *fs display. I will fix this
vbus interupts are active
shortly.
The larger your installed base, the more difficult it is. Of courseFair enough. But note you are likely going to need to respin your
it's doable, but I prefer not doing it and instead improving things in a
binary backwards compatible manner. If there is no choice we will bow
to the inevitable and make our users upgrade. But at this point there
is a choice, and I prefer to stick with vhost-net until it is proven
that it won't work.
existing drivers anyway to gain peak performance, since there are known
shortcomings in the virtio-pci ABI today (like queue identification in
the interrupt hotpath) as it stands. So that pain is coming one way or
the other.
One of the benefits of virtualization is that the guest model isI understand what you are saying, but I don't buy it. If you add a new
stable. You can live-migrate guests and upgrade the hardware
underneath. You can have a single guest image that you clone to
provision new guests. If you switch to a new model, you give up those
benefits, or you support both models indefinitely.
feature to an existing model even without something as drastic as a new
bus, you still have the same exact dilemma: The migration target needs
feature parity with consumed features in the guest. Its really the same
no matter what unless you never add guest-visible features.
Note even hardware nowadays is binary compatible. One e1000 driverNoted, but that is not really the same thing. Thats more like adding a
supports a ton of different cards, and I think (not sure) newer cards
will work with older drivers, just without all their features.
feature bit to virtio, not replacing GigE with 10GE.
Thats fine, the distros generally do this automatically when you loadIf and when that becomes a priority concern, that would be a functionNo, you have to update the driver in your initrd (for Linux)
transparently supported in the BIOS shipped with the hypervisor, and
would thus be invisible to the user.
the updated KMP package.
or properly install the new driver (for Windows). It's especiallyWhat is difficult here? I never seem to have any problems and I have
difficult for Windows.
all kinds of guests from XP to Win7.
I don't want to support both virtio and vbus in parallel. There'sUntil I find some compelling reason that indicates I was wrong about all
enough work already.
of this, I will continue building a community around the vbus code base
and developing support for its components anyway. So that effort is
going to happen in parallel regardless.
This is purely a question about whether you will work with me to make
vbus an available option in upstream KVM or not.
If we adopt vbus, we'll have to deprecate and eventually kill off virtio.Thats more hyperbole. virtio is technically fine and complementary as
it is. No one says you have to do anything drastic w.r.t. virtio. If
you _did_ adopt vbus, perhaps you would want to optionally deprecate
vhost or possibly the virtio-pci adapter, but that is about it. The
rest of the infrastructure should be preserved if it was designed properly.
PCI is continuously updated, with MSI, MSI-X, and IOMMU support beingWhile a noble goal, one of the points I keep making though, as someone
some recent updates. I'd like to ride on top of that instead of having
to clone it for every guest I support.
who has built the stack both ways, is almost none of the PCI stack is
actually needed to get the PV job done. The part you do need is
primarily a function of the generic OS stack and trivial to interface
with anyway.
No, you said KVM has "userspace hotplug". I retorted that vbus not only
It's the compare venet-in-kernel to virtio-in-userspace thing again.Citation?As an added bonus, its device-model is modular. A developer canWe've seen that herring before,
write a
new device model, compile it, insmod it to the host kernel, hotplug it
to the running guest with mkdir/ln, and the come back out again
(hotunplug with rmdir, rmmod, etc). They may do this all without
taking
the guest down, and while eating QEMU based IO solutions for breakfast
performance wise.
Afaict, qemu can't do either of those things.
has hotplug, it also has a modular architecture. You then countered
that this feature is a red-herring. If this was previously discussed
and rejected for some reason, I would like to know the history. Or did
I misunderstand you?
For one, we have the common layer of shm-signal, and IOQ. These
libraries were designed to be reused on both sides of the link.
Generally shm-signal has no counterpart in the existing model, though
its functionality is integrated into the virtqueue.
From there, going down the stack, it looks like
(guest-side)
|-------------------------
| venet (competes with virtio-net)
|-------------------------
| vbus-proxy (competes with pci-bus, config+hotplug, sync/async)
|-------------------------
| vbus-pcibridge (interrupt coalescing + priority, fastpath)
|-------------------------
|
|-------------------------
| vbus-kvmconnector (interrupt coalescing + priority, fast-path)
|-------------------------
| vbus-core (hotplug, address decoding, etc)
|-------------------------
| venet-device (ioq frame/deframe to tap/macvlan/vmdq, etc)
|-------------------------
If you want to use virtio, insert a virtio layer between the "driver"
and "device" components at the outer edges of the stack.
To me, compatible means I can live migrate an image to a new systemAs soon as you add any new guest-visible feature, you are in the same
without the user knowing about the change. You'll be able to do that
with vhost-net.
exact boat.
The hypercall channel is already SMP optimized over a single PIO path,No, that is incorrect. For one, vhost uses them on a per-signal pathYou'll probably need to change that as you start running smp guests.
basis, whereas vbus only has one channel for the entire guest->host.
so I think we are covered there. See "fastcall" in my code for details:
http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=drivers/vbus/pci-bridge.c;h=81f7cdd2167ae2f53406850ebac448a2183842f2;hb=fd1c156be7735f8b259579f18268a756beccfc96#l102
It just passes the cpuid into the PIO write so we can have parallel,
lockless "hypercalls". This forms the basis of our guest scheduler
support, for instance.
You could implement virtio-net hardware if you wanted to.Technically you could build vbus in hardware too, I suppose, since the
bridge is PCI compliant. I would never advocate it, however, since many
of our tricks do not matter if its real hardware (e.g. they are
optimized for the costs associated with VM).