Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifierrace conditions

From: Gregory Haskins
Date: Mon Jun 22 2009 - 16:06:45 EST

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>> Michael S. Tsirkin wrote:
>>> On Mon, Jun 22, 2009 at 11:51:42AM -0700, Davide Libenzi wrote:
>>>> A file* based kernel-to-kernel interface is rather wrong IMO.
>>> But eventfd_ctx should work fine.
>> Yeah, and I guess we can always just say that qemu can't close the fd or
>> something. Seems hacky, but it might work if Davide insists we need his
>> change.
> Continuing here, since there's no reason of having many subthreads talking
> about the same thing.
> Can you make a detailed example of what you're trying to achieve (no Hint
> Mode, please)?
> As it sounds to me, that you need a consumer/producer reference counting,
> to cover your scenario correctly.

Well, one of them was already briefly mentioned (the PCI-passthrough
thing). I am not personally working on this part (yet, anyway).

Another example of something I am actually working on as we speak would
be for this thing we are building called "virtual-bus". It is a way to
build/deploy device models directly in the kernel.

In either of these cases, we have this concept of allowing the guest to
notify the host, or vice versa, that something happened. Typically this
would be in reference to some chunk of shared memory, and the signaling
is telling the other side "I changed something, go look".

Without going into a ton of detail (unless, of course, you want it) is
that we are generalizing the signaling infrastructure (irqfd and
iosignalfd) so that something like PCI-passthrough or vbus are not
directly coupled to KVM. They communicate to KVM purely in terms of
(among other things) these irqfd/iosignalfd interfaces.

Using vbus as an example (though others are similar): vbus would
primarily exists as a kernel-model. However, there would be a small
device model in qemu-kem userspace to publish something like a PCI
device that declares its resource requirements to the guest. Some of
those requirements would be things like how many interrupts it needs,
and what IO ranges it supports, etc. When the guest programs the PCI
space, it maps the resources from its own world into the virtual PCI
resources emulated in qemu.

So up in userspace, the vbus pci-device would have an open reference to
the kvm guest (derived from /dev/kvm) and an open reference to a vbus
(derived from /dev/vbus). Lets call these kvmfd, and vbusfd,
respectively. For something like an interrupt, we would hook the point
where the PCI-MSI interrupt is assigned, and would do the following:

gsi = kvm_irq_route_gsi();
fd = eventfd(0, 0);
ioctl(kvmfd, KVM_IRQFD_ASSIGN, {fd, gsi});
ioctl(vbusfd, VBUS_SHMSIGNAL_ASSIGN, {sigid, fd});

So userspace orchestrated the assignment of this one eventfd to a KVM
consumer, and a VBUS producer. The two subsystems do not care about the
details of the other side of the link, per se. VBUS just knows that it
can eventfd_signal() its memory region to tell whomever is listening
that it changed. Likewise, KVM just knows to inject "gsi" when it gets
signalled. You could equally have given "fd" to a userspace thread for
either producer or consumer roles, or any other combination.

If we were doing PCI-passthough, substitute the last SHMSIGNAL_ASSIGN
ioctl call with some PCI_PASSTHROUGH_ASSIGN verb and you get the idea.

The important thing is that once this is established, userspace doesn't
necessarily care about the fd anymore. So now the question is: do we
keep it around for other things? Do we keep it around because we don't
want KVM to see the POLLHUP, or do we address the "release" code so that
it works even if userspace issued close(fd) at this point. I am not
sure what the answer is, but this is the scenario we are concerned with
in this thread. In the example above, vbus is free to produce events on
its eventfd until it gets a SHMSIGNAL_DEASSIGN request.


> - Davide
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at

Attachment: signature.asc
Description: OpenPGP digital signature