Re: [PATCH v7 2/2] kvm: KVM_EOIFD, an eventfd for EOIs
From: Michael S. Tsirkin
Date: Mon Jul 30 2012 - 20:01:04 EST
On Mon, Jul 30, 2012 at 10:22:10AM -0600, Alex Williamson wrote:
> On Sun, 2012-07-29 at 17:54 +0300, Michael S. Tsirkin wrote:
> > On Tue, Jul 24, 2012 at 02:43:22PM -0600, Alex Williamson wrote:
> > > This new ioctl enables an eventfd to be triggered when an EOI is
> > > written for a specified irqchip pin. The first user of this will
> > > be external device assignment through VFIO, using a level irqfd
> > > for asserting a PCI INTx interrupt and this interface for de-assert
> > > and notification once the interrupt is serviced.
> > >
> > > Here we make use of the reference counting of the _irq_source
> > > object allowing us to share it with an irqfd and cleanup regardless
> > > of the release order.
> > >
> > > Signed-off-by: Alex Williamson <alex.williamson@xxxxxxxxxx>
> >
> > > ---
> > >
> > > Documentation/virtual/kvm/api.txt | 21 ++
> > > arch/x86/kvm/x86.c | 2
> > > include/linux/kvm.h | 15 ++
> > > include/linux/kvm_host.h | 13 +
> > > virt/kvm/eventfd.c | 336 +++++++++++++++++++++++++++++++++++++
> > > virt/kvm/kvm_main.c | 11 +
> > > 6 files changed, 398 insertions(+)
> > >
> > > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt
> > > index 3911e62..8cd6b36 100644
> > > --- a/Documentation/virtual/kvm/api.txt
> > > +++ b/Documentation/virtual/kvm/api.txt
> > > @@ -1989,6 +1989,27 @@ return the hash table order in the parameter. (If the guest is using
> > > the virtualized real-mode area (VRMA) facility, the kernel will
> > > re-create the VMRA HPTEs on the next KVM_RUN of any vcpu.)
> > >
> > > +4.77 KVM_EOIFD
> > > +
> > > +Capability: KVM_CAP_EOIFD
> > > +Architectures: x86
> > > +Type: vm ioctl
> > > +Parameters: struct kvm_eoifd (in)
> > > +Returns: 0 on success, < 0 on error
> > > +
> > > +KVM_EOIFD allows userspace to receive interrupt EOI notification
> > > +through an eventfd.
> >
> > I thought about it some more, and I think it should be renamed to an
> > interrupt ack notification than eoi notification.
> > For example, consider userspace that uses threaded interrupts.
> > Currently what will happen is each interrupt will be injected
> > twice, since on eoi device is still asserting it.
>
> I don't follow, why is userspace writing an eoi to the ioapic if it
> hasn't handled the interrupt
It has handled it - it disabled the hardware interrupt.
> and why wouldn't the same happen on bare
> metal?
on bare metal level does not matter as long as interrupt
is disabled.
> > One fix would be to delay event until interrupt is re-enabled.
> > Now I am not asking you to fix this immediately,
> > but I think we should make the interface generic by
> > saying we report an ack to userspace and not specifically EOI.
>
> Using the word "delay" in the context of interrupt delivery raises all
> sorts of red flags for me, but I really don't understand your argument.
I am saying it's an "ack" of interrupt userspace cares about.
The fact it is done by EOI is an implementation detail.
> > > kvm_eoifd.fd specifies the eventfd used for
> > > +notification. KVM_EOIFD_FLAG_DEASSIGN is used to de-assign an eoifd
> > > +once assigned. KVM_EOIFD also requires additional bits set in
> > > +kvm_eoifd.flags to bind to the proper interrupt line. The
> > > +KVM_EOIFD_FLAG_LEVEL_IRQFD indicates that kvm_eoifd.key is provided
> > > +and is a key from a level triggered interrupt (configured from
> > > +KVM_IRQFD using KVM_IRQFD_FLAG_LEVEL). The EOI notification is bound
> > > +to the same GSI and irqchip input as the irqfd. Both kvm_eoifd.key
> > > +and KVM_EOIFD_FLAG_LEVEL_IRQFD must be specified on assignment and
> > > +de-assignment of KVM_EOIFD. A level irqfd may only be bound to a
> > > +single eoifd. KVM_CAP_EOIFD_LEVEL_IRQFD indicates support of
> > > +KVM_EOIFD_FLAG_LEVEL_IRQFD.
> > >
> >
> > Hmm returning the key means we'll need to keep refcounting for source
> > IDs around forever. I liked passing the fd better: make implementation
> > match interface and not the other way around.
>
> False, a source ID has a finite lifecycle. The fd approach was broken.
> Holding the irqfd context imposed too many dependencies between eoifd
> and irqfd necessitating things like one interface disabling another. I
> thoroughly disagree with that approach.
You keep saying this but it is still true: once irqfd
is closed eoifd does not get any more interrupts.
> > > 5. The kvm_run structure
> > > ------------------------
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 9ded39d..8f3164e 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -2171,6 +2171,8 @@ int kvm_dev_ioctl_check_extension(long ext)
> > > case KVM_CAP_PCI_2_3:
> > > case KVM_CAP_KVMCLOCK_CTRL:
> > > case KVM_CAP_IRQFD_LEVEL:
> > > + case KVM_CAP_EOIFD:
> > > + case KVM_CAP_EOIFD_LEVEL_IRQFD:
> > > r = 1;
> > > break;
> > > case KVM_CAP_COALESCED_MMIO:
> > > diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> > > index b2e6e4f..effb916 100644
> > > --- a/include/linux/kvm.h
> > > +++ b/include/linux/kvm.h
> > > @@ -619,6 +619,8 @@ struct kvm_ppc_smmu_info {
> > > #define KVM_CAP_S390_COW 79
> > > #define KVM_CAP_PPC_ALLOC_HTAB 80
> > > #define KVM_CAP_IRQFD_LEVEL 81
> > > +#define KVM_CAP_EOIFD 82
> > > +#define KVM_CAP_EOIFD_LEVEL_IRQFD 83
> > >
> > > #ifdef KVM_CAP_IRQ_ROUTING
> > >
> > > @@ -694,6 +696,17 @@ struct kvm_irqfd {
> > > __u8 pad[20];
> > > };
> > >
> > > +#define KVM_EOIFD_FLAG_DEASSIGN (1 << 0)
> > > +/* Available with KVM_CAP_EOIFD_LEVEL_IRQFD */
> > > +#define KVM_EOIFD_FLAG_LEVEL_IRQFD (1 << 1)
> > > +
> > > +struct kvm_eoifd {
> > > + __u32 fd;
> > > + __u32 flags;
> > > + __u32 key;
> > > + __u8 pad[20];
> > > +};
> > > +
> > > struct kvm_clock_data {
> > > __u64 clock;
> > > __u32 flags;
> > > @@ -834,6 +847,8 @@ struct kvm_s390_ucas_mapping {
> > > #define KVM_PPC_GET_SMMU_INFO _IOR(KVMIO, 0xa6, struct kvm_ppc_smmu_info)
> > > /* Available with KVM_CAP_PPC_ALLOC_HTAB */
> > > #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32)
> > > +/* Available with KVM_CAP_EOIFD */
> > > +#define KVM_EOIFD _IOW(KVMIO, 0xa8, struct kvm_eoifd)
> > >
> > > /*
> > > * ioctls for vcpu fds
> > > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > > index c73f071..01e72a6 100644
> > > --- a/include/linux/kvm_host.h
> > > +++ b/include/linux/kvm_host.h
> > > @@ -289,6 +289,10 @@ struct kvm {
> > > struct mutex lock;
> > > struct list_head items;
> > > } irqsources;
> > > + struct {
> > > + spinlock_t lock;
> > > + struct list_head items;
> > > + } eoifds;
> > > #endif
> > > struct kvm_vm_stat stat;
> > > struct kvm_arch arch;
> > > @@ -832,6 +836,8 @@ int kvm_irqfd(struct kvm *kvm, struct kvm_irqfd *args);
> > > void kvm_irqfd_release(struct kvm *kvm);
> > > void kvm_irq_routing_update(struct kvm *, struct kvm_irq_routing_table *);
> > > int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args);
> > > +int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args);
> > > +void kvm_eoifd_release(struct kvm *kvm);
> > >
> > > #else
> > >
> > > @@ -857,6 +863,13 @@ static inline int kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > > return -ENOSYS;
> > > }
> > >
> > > +static inline int kvm_eoifd(struct kvm *kvm, struct kvm_eoifd *args)
> > > +{
> > > + return -ENOSYS;
> > > +}
> > > +
> > > +static inline void kvm_eoifd_release(struct kvm *kvm) {}
> > > +
> > > #endif /* CONFIG_HAVE_KVM_EVENTFD */
> > >
> > > #ifdef CONFIG_KVM_APIC_ARCHITECTURE
> > > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > > index 878cb52..3aa2d62 100644
> > > --- a/virt/kvm/eventfd.c
> > > +++ b/virt/kvm/eventfd.c
> > > @@ -95,6 +95,25 @@ static struct _irq_source *_irq_source_alloc(struct kvm *kvm, int gsi)
> > > return source;
> > > }
> > >
> > > +static struct _irq_source *_irq_source_get_from_key(struct kvm *kvm, int key)
> > > +{
> > > + struct _irq_source *tmp, *source = ERR_PTR(-ENOENT);
> > > +
> > > + mutex_lock(&kvm->irqsources.lock);
> > > +
> > > + list_for_each_entry(tmp, &kvm->irqsources.items, list) {
> > > + if (tmp->id == key) {
> > > + source = tmp;
> > > + kref_get(&source->kref);
> > > + break;
> > > + }
> > > + }
> > > +
> > > + mutex_unlock(&kvm->irqsources.lock);
> > > +
> > > + return source;
> > > +}
> > > +
> > > /*
> > > * --------------------------------------------------------------------
> > > * irqfd: Allows an fd to be used to inject an interrupt to the guest
> > > @@ -406,6 +425,8 @@ kvm_eventfd_init(struct kvm *kvm)
> > > INIT_LIST_HEAD(&kvm->ioeventfds);
> > > mutex_init(&kvm->irqsources.lock);
> > > INIT_LIST_HEAD(&kvm->irqsources.items);
> > > + spin_lock_init(&kvm->eoifds.lock);
> > > + INIT_LIST_HEAD(&kvm->eoifds.items);
> > > }
> > >
> > > /*
> > > @@ -772,3 +793,318 @@ kvm_ioeventfd(struct kvm *kvm, struct kvm_ioeventfd *args)
> > >
> > > return kvm_assign_ioeventfd(kvm, args);
> > > }
> > > +
> > > +/*
> > > + * --------------------------------------------------------------------
> > > + * eoifd: Translate KVM APIC/IOAPIC EOI into eventfd signal.
> > > + *
> > > + * userspace can register with an eventfd for receiving
> > > + * notification when an EOI occurs.
> > > + * --------------------------------------------------------------------
> > > + */
> > > +
> > > +struct _eoifd {
> > > + /* eventfd triggered on EOI */
> > > + struct eventfd_ctx *eventfd;
> > > + /* irq source ID de-asserted on EOI */
> > > + struct _irq_source *source;
> > > + wait_queue_t wait;
> > > + /* EOI notification from KVM */
> > > + struct kvm_irq_ack_notifier notifier;
> > > + struct list_head list;
> > > + poll_table pt;
> > > + struct work_struct shutdown;
> > > +};
> > > +
> > > +/* Called under eoifds.lock */
> > > +static void eoifd_shutdown(struct work_struct *work)
> > > +{
> > > + struct _eoifd *eoifd = container_of(work, struct _eoifd, shutdown);
> > > + struct kvm *kvm = eoifd->source->kvm;
> > > + u64 cnt;
> > > +
> > > + /*
> > > + * Stop EOI signaling
> > > + */
> > > + kvm_unregister_irq_ack_notifier(kvm, &eoifd->notifier);
> > > +
> > > + /*
> > > + * Synchronize with the wait-queue and unhook ourselves to prevent
> > > + * further events.
> > > + */
> > > + eventfd_ctx_remove_wait_queue(eoifd->eventfd, &eoifd->wait, &cnt);
> > > +
> > > + /*
> > > + * Release resources
> > > + */
> > > + eventfd_ctx_put(eoifd->eventfd);
> > > + _irq_source_put(eoifd->source);
> > > + kfree(eoifd);
> > > +}
> > > +
> > > +/* assumes kvm->eoifds.lock is held */
> > > +static bool eoifd_is_active(struct _eoifd *eoifd)
> > > +{
> > > + return list_empty(&eoifd->list) ? false : true;
> > > +}
> > > +
> > > +/*
> > > + * Mark the eoifd as inactive and schedule it for removal
> > > + *
> > > + * assumes kvm->eoifds.lock is held
> > > + */
> > > +static void eoifd_deactivate(struct _eoifd *eoifd)
> > > +{
> > > + BUG_ON(!eoifd_is_active(eoifd));
> > > +
> > > + list_del_init(&eoifd->list);
> > > +
> > > + queue_work(irqfd_cleanup_wq, &eoifd->shutdown);
> > > +}
> > > +
> > > +/*
> > > + * Called with wqh->lock held and interrupts disabled
> > > + */
> > > +static int eoifd_wakeup(wait_queue_t *wait, unsigned mode, int sync, void *key)
> > > +{
> > > + unsigned long flags = (unsigned long)key;
> > > +
> > > + if (unlikely(flags & POLLHUP)) {
> > > + /* The eventfd is closing, detach from KVM */
> > > + struct _eoifd *eoifd = container_of(wait, struct _eoifd, wait);
> > > + struct kvm *kvm = eoifd->source->kvm;
> > > + unsigned long flags;
> > > +
> > > + spin_lock_irqsave(&kvm->eoifds.lock, flags);
> > > +
> > > + /*
> > > + * We must check if someone deactivated the eoifd before
> > > + * we could acquire the eoifds.lock since the item is
> > > + * deactivated from the KVM side before it is unhooked from
> > > + * the wait-queue. If it is already deactivated, we can
> > > + * simply return knowing the other side will cleanup for us.
> > > + * We cannot race against the eoifd going away since the
> > > + * other side is required to acquire wqh->lock, which we hold
> > > + */
> > > + if (eoifd_is_active(eoifd))
> > > + eoifd_deactivate(eoifd);
> > > +
> > > + spin_unlock_irqrestore(&kvm->eoifds.lock, flags);
> > > + }
> > > +
> > > + return 0;
> > > +}
> >
> > Looks like there is a bug here: if I close irqfd, then close eoifd,
> > the key is not immediately released so an attempt to create
> > an irqfd can fail to get the source id.
>
> Both irqfd and eoifd use the same workqueue for releasing objects and
> both flush on assign.
>
> > Maybe we should simply document that userspace should deassign
> > eoifd before closing it? This is what we do for ioeventfd.
> > If we do this, the whole polling code can go away completely.
>
> You're again ignoring the close problem. We cannot document around an
> impossible requirement that fds are always deassigned before close.
Well userspace can easily call a deassign ioctl. Why is it so important
that deassign is not required?
> IMHO ioeventfd is broken here and I don't wish to emulate it's behavior.
So fix ioeventfd first. Making eoifd and ioeventfd behave differently does not
make sense they are very similar.
> > > +
> > > +static void eoifd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> > > + poll_table *pt)
> > > +{
> > > + struct _eoifd *eoifd = container_of(pt, struct _eoifd, pt);
> > > + add_wait_queue(wqh, &eoifd->wait);
> > > +}
> > > +
> > > +/*
> > > + * This function is called as the kvm VM fd is being released. Shutdown all
> > > + * eoifds that still remain open
> > > + */
> > > +void kvm_eoifd_release(struct kvm *kvm)
> > > +{
> > > + struct _eoifd *tmp, *eoifd;
> > > +
> > > + spin_lock_irq(&kvm->eoifds.lock);
> > > +
> > > + list_for_each_entry_safe(eoifd, tmp, &kvm->eoifds.items, list)
> > > + eoifd_deactivate(eoifd);
> > > +
> > > + spin_unlock_irq(&kvm->eoifds.lock);
> > > +
> > > + flush_workqueue(irqfd_cleanup_wq);
> > > +}
> > > +
> > > +static void eoifd_event(struct kvm_irq_ack_notifier *notifier)
> > > +{
> > > + struct _eoifd *eoifd;
> > > +
> > > + eoifd = container_of(notifier, struct _eoifd, notifier);
> > > +
> > > + if (unlikely(!eoifd->source))
> > > + return;
> > > +
> > > + /*
> > > + * De-assert and send EOI, user needs to re-assert if
> > > + * device still requires service.
> > > + */
> >
> > I'm not sure why did you drop filtering by source id.
> > This means userspace gets events even if it did not send an interrupt.
> > So
> > 1. Should be documented that you can get spurious events
> > 2. when an interrupt is shared with an emulated device,
> > and said device uses EOI, this will not
> > perform well as we will wake up userspace on each EOI.
> > 3. Just sharing interrupt with virtio means we are polling
> > assigned device on each virtio interrupt.
>
> Didn't we just agree after v5 that filtering requires a spinlock around
> around calling kvm_irq_set or at least a new interface to setting irqs
> that allows us to see the current assertion state and that neither of
> those seem to be worth the effort for level irqs? That's why I dropped
> it. Interrupts always have to support spurious events. The comment
> immediately above indicates this. Legacy interrupts, especially shared
> legacy interrupts should not be our primary performance path. VFIO has
> a very efficient path for handling spurious EOIs.
But it will not help that vfio does this efficiently if userspace
is woken up. You need to make it efficient for userspace consumers.
Otherwise it's a vfio specific interface.
--
MST
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/