Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifierinterface

From: Davide Libenzi
Date: Wed Jun 17 2009 - 19:28:37 EST

On Wed, 17 Jun 2009, Gregory Haskins wrote:

> I am not clear on what you are asking here. So in case this was a
> sincere question on how things work, here is a highlight of the typical
> flow of a packet that ingresses on a physical adapter and routes to KVM
> via vbus.
> a) interrupt from eth to host wakes the cpu out of idle, enters
> interrupt-context.
> b) ISR runs, disables interrupts on the eth, schedules softirq-net-rx
> c) ISR completes, kernel checks softirq state before IRET, runs pending
> softirq-net-rx in interrupt context to NAPI-poll the ethernet
> d) packet is pulled out of eth into a layer-2 bridge, and switched to
> the appropriate switch-port (which happens to be a venet-tap (soon to be
> virtio-net based) device. The packet is queued to the tap as an xmit
> request, and the tap's tx-thread is awoken.
> e) the xmit call returns, the napi-poll completes, and the
> softirq-net-rx terminates. The kernel does an IRET to exit interrupt
> context.
> f) the scheduler runs and sees the tap's tx-thread is ready to run,
> schedules it in.
> g) the tx-thread awakens, dequeues the posted skb, copies it to the
> virtio-ring, and finally raises an interrupt on irqfd with eventfd_signal().

Heh, Gregory, this isn't a job interview. You didn't have to actually
detail everything ;) Glad you did though, so we've something to talk

> At this point, all of the data has been posted to the virtio-ring in
> shared memory between the host and guest. All that is left is to inject
> the interrupt so the guest knows to process the ring. We call the
> eventfd_signal() from kthread context. However, the callback to inject
> the interrupt is invoked with the wqh spinlock held so we are forced to
> defer the interrupt injection to a workqueue so the kvm->lock can be
> safely acquired. This adds additional latency (~7us) in a path that is
> only a handful of microseconds to being with. I can post LTTV
> screenshots, if it would be helpful to visualize this.

So, what you're trying to say, is that the extra wakeup required by your
schedule_work() processing, makes actually a difference in the time it
takes to go from a) to g), where there are at least two other kernel
thread wakeups?
Don't think in terms of microbenchs, think in how much are those 7us (are
they? really? this is a sync, on-cpu, kernel thread, wake up) are
impacting the whole path.
Everything looks worthwhile in microbenches.

> Right, understood, and note that this is precisely why I said it would
> oops. What I was specifically trying to highlight is that its not like
> this change imposes new requirements on the existing callbacks. I also
> wanted to highlight that its not really eventfd's concern what the
> callback tries to do, per se (if it wants to sleep it can try, it just
> wont work). Any reasonable wakeup callback in existence would already
> assume it cannot sleep, and they wouldn't even try to begin with.
> On the other hand, what I am introducing here (specifically to eventfd
> callbacks, not wait-queues [*]) is the possibility of removing this
> restriction under the proper circumstances. It would only be apparent
> of this change iff the callback in question tried to test for this (e.g.
> checking preemptible()) state. It is thus opt-in, and existing code
> does not break.

The interface is just ugly IMO. You have eventfd_signal() that can sleep,
or not, depending on the registered ->signal() function implementations.
This is pretty bad, a lot worse than the theoretical us spent in the
schedule_work() processing.

- Davide

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at