Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifierrace conditions

From: Gregory Haskins
Date: Mon Jun 22 2009 - 14:32:22 EST

Davide Libenzi wrote:
> On Mon, 22 Jun 2009, Gregory Haskins wrote:
>> I am probably confused or perhaps have the wrong terminology, but isnt
>> that "ok". I am concerned about the consumer (the guy getting the
>> POLLINs) to be able to detect POLLHUP when the last producer
>> (f_ops->write() from userspace, eventfd_signal() from kernel) goes away.
>> Consider the following sequence:
>> -------------------
>> userspace calls "fd = eventfd()", and gives one to KVM as an irqfd, and
>> the other to some PCI-passthrough device.
>> The kvm/irqfd side acquires a kref, the pci side acquires a file. At
>> this moment, userspace has the fd, and the pci device has the file (for
>> eventfd_signal()). The fget() count is 2. Userspace closes the fd
>> because its done with it, and the count drops to 1.
>> Some time later, pci does an fput(), and KVM sees the POLLHUP and cleans up.
>> -------------------
>> In this new model, the POLLHUP would have gone out as soon as userspace
>> closed the fd, even though the intended producer (the PCI device) and
>> the consumer (the KVM guest) are still up and running. This doesnt seem
>> right to me. Or am I missing something?
> What you're doing there, is setting up a kernel-to-kernel (since
> userspace only role is to create the eventfd) communication, using a file*
> as accessory. That IMO is plain wrong.
> If userspace is either the producer, or the consumer, and you need to
> handle userspace leaving the building, you need to:
> file = eventfd_fget(fd);
> ctx = eventfd_ctx_get(file); /* Eventually, if producer */
> eventfd_pollcb_register(file, ...);
> fput(file);
> In your case of kernel-to-kernel scenario, why would you need eventfd at
> all, if userspace role in that model is simply to create it?

The general thesis is for decoupling of the two subsystems. In order to
do this, you need some form of polymorphism and an intermediate "handle"
mechanism which is userspace friendly. File-descriptors already fit
this role neatly, with the "int fd" being the handle, and the f_ops
being the polymorphic interface. Eventfd is of course, a subclass of
this concept in that it has these same general properties but with
signaling semantics (non-blocking collapsible events, etc).

Say, for example, you wanted disk IO completion events to generate an
interrupt into a guest. One way to do this would, of course, modify all
the disk-io code so it knows how to directly inject a KVM guest
interrupt. While this would work, someone would undoubtedly get flamed
for such a suggestion ;)

Another way to do it is to treat the AIO eventfd as the hook point.
IIUC AIO already knows how to be an eventfd producer. KVM, by virtue of
irqfd, already knows how to be an eventfd consumer. So now kvm can
consume AIO, or it can consume userspace events equally well, and
without modification. Neither side needs to know about the other per
se, other than the details on how to use the eventfd interface.

Don't get me wrong: We expect userspace to use all this stuff too. I
just expect that we will see all permutations of producer/consumer +
userspace/kernel combinations, so I want to retain that "all producers
have left" notification feature set. Today eventfd supports producers
or consumers in userspace, and producers in the kernel. This new work
we are doing adds consumer support in the kernel. Kernel to kernel is
just a natural extension of that.


> There are more effective ways to have in kernel communication channels,
> than resorting to userspace link facilities like eventfd.
> - Davide
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at

Attachment: signature.asc
Description: OpenPGP digital signature