Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifierinterface

From: Gregory Haskins
Date: Wed Jun 17 2009 - 17:48:30 EST

Davide Libenzi wrote:
> On Wed, 17 Jun 2009, Gregory Haskins wrote:
>> Davide Libenzi wrote:
>>> How much the (possible, but not certain) kernel thread context switch time
>>> weighs in the overall KVM IRQ service time?
>> Generally each one is costing me about 7us on average. For something
>> like high-speed networking, we have a path that has about 30us of
>> base-line overhead. So one additional ctx-switch puts me at base+7 ( =
>> ~37us), two puts me in base+2*7 (= ~44us). So in that context (no pun
>> intended ;), it hurts quite a bit. I'll be the first to admit that not
>> everyone (most?) will care about latency, though. But FWIW, I do.
> And how a frame reception is handled in Linux nowadays?

I am not clear on what you are asking here. So in case this was a
sincere question on how things work, here is a highlight of the typical
flow of a packet that ingresses on a physical adapter and routes to KVM
via vbus.

a) interrupt from eth to host wakes the cpu out of idle, enters
b) ISR runs, disables interrupts on the eth, schedules softirq-net-rx
c) ISR completes, kernel checks softirq state before IRET, runs pending
softirq-net-rx in interrupt context to NAPI-poll the ethernet
d) packet is pulled out of eth into a layer-2 bridge, and switched to
the appropriate switch-port (which happens to be a venet-tap (soon to be
virtio-net based) device. The packet is queued to the tap as an xmit
request, and the tap's tx-thread is awoken.
e) the xmit call returns, the napi-poll completes, and the
softirq-net-rx terminates. The kernel does an IRET to exit interrupt
f) the scheduler runs and sees the tap's tx-thread is ready to run,
schedules it in.
g) the tx-thread awakens, dequeues the posted skb, copies it to the
virtio-ring, and finally raises an interrupt on irqfd with eventfd_signal().

At this point, all of the data has been posted to the virtio-ring in
shared memory between the host and guest. All that is left is to inject
the interrupt so the guest knows to process the ring. We call the
eventfd_signal() from kthread context. However, the callback to inject
the interrupt is invoked with the wqh spinlock held so we are forced to
defer the interrupt injection to a workqueue so the kvm->lock can be
safely acquired. This adds additional latency (~7us) in a path that is
only a handful of microseconds to being with. I can post LTTV
screenshots, if it would be helpful to visualize this.

The reasons we need the workqueue is:

A) kvm is currently implemented with a mutex for IRQ injection, so its
B) the wqh used in eventfd wake_up() creates an non-preemptible section
when the callback is invoked.

We can (and should) address (A), but this is but one example in a common
pattern. We will run into similar issues in other places (such as with
iosignalfd), so I would like to address (B) as well. My patch attempts
to do that by re-implementing the callback mechanism as something other
than wait-queue based. It then adds a wait-queue as a default client of
the new interface.

Therefore, the eventfd clients that want traditional vanilla wakeups can
go right ahead and use the non-preemptible wait-queue methods as they
always have. However, the clients that want (potentially) preemptive
callbacks can use the new interface: eventfd_notifier_register(), armed
with the knowledge that it can only sleep if the eventfd_signal() caller
was not in-atomic at the time. Thus the dynamic state check ala
preemptible(). Without the check, the client should assume it is not
preemptible, as always.
>> True, but thats the notifiee's burden, not eventfd's. And its always
>> going to be opt-in. Even today, someone is free to either try to sleep
>> (which will oops on the might_sleep()), ...
> No, today you just can't sleep. As you can't sleep in any
> callback-registered wakeups, like epoll, for example.

Right, understood, and note that this is precisely why I said it would
oops. What I was specifically trying to highlight is that its not like
this change imposes new requirements on the existing callbacks. I also
wanted to highlight that its not really eventfd's concern what the
callback tries to do, per se (if it wants to sleep it can try, it just
wont work). Any reasonable wakeup callback in existence would already
assume it cannot sleep, and they wouldn't even try to begin with.

On the other hand, what I am introducing here (specifically to eventfd
callbacks, not wait-queues [*]) is the possibility of removing this
restriction under the proper circumstances. It would only be apparent
of this change iff the callback in question tried to test for this (e.g.
checking preemptible()) state. It is thus opt-in, and existing code
does not break.

Thanks Davide,

[*]: I contemplated solving this problem generally at the wait-queue
level, but quickly abandoned the idea. The reason is that something
like *RCU has the properties of being particularly lightweight in the
fast/common/read-path, but being particularly heavyweight in the
slow/uncommon/writer-path. Contrast that with something like a
traditional lock/mutex which is middleweight in all circumstances.

Something like a wait-queue has a heavy dose of both "writers" (tasks
being added to the queue) as well as "readers" (walking the queue to
invoke callbacks). RCU would be particularly painful here. Contrast
this to the way I wrote the patch for eventfd: The "writer" path is
pretty rare ("add me to be notified" generally happens once at eventfd
init), whereas the "reader" path ("tell me when someone signals") is
fairly common. RCU is a good candidate here to get rid of the spinlock
held during the walk, while still maintaining thread-saftey w.r.t. the
notifier list. Therefore I think updating eventfd makes more sense than
the more general case of wait-queues. More importantly, I do not think
this compromises the integrity of eventfd. Its applicable for a general
signaling idiom, IMHO, and its why I am comfortable pushing the patch
out to you.

Attachment: signature.asc
Description: OpenPGP digital signature