Re: [KVM-RFC PATCH 1/2] eventfd: add an explicit srcu based notifierinterface

From: Gregory Haskins
Date: Wed Jun 17 2009 - 15:18:04 EST

Davide Libenzi wrote:
> On Wed, 17 Jun 2009, Gregory Haskins wrote:
>> Can you elaborate? I currently do not see how I could do the proposed
>> concept inside of irqfd while still using eventfd. Of course, that
>> would be possible if we fork irqfd from eventfd, and perhaps this is
>> what you are proposing. As previously stated I don't want to give up on
>> the prospect of re-using it quite yet, so bear with me. :)
>> The issue with eventfd, as I see it, is that eventfd uses a
>> spin_lock_irqsave (by virtue of the wait-queue stuff) across the
>> "signal" callback (which today is implemented as a wake-up). This
>> spin_lock implicitly creates a non-preemptible critical section that
>> occurs independently of whether eventfd_signal() itself is invoked from
>> a sleepable context or not.
>> What I strive to achieve is to remove the creation of this internal
>> critical section. If eventfd_signal() is called from atomic context, so
>> be it. We will detect this in the callback and be forced to take the
>> slow-path, and I am ok with that. *But*, if eventfd_signal() (or
>> f_ops->write(), for that matter) are called from a sleepable context
>> *and* eventfd doesn't introduce its own critical section (such as with
>> my srcu patch), we can potentially optimize within the callback by
>> executing serially instead of deferring (e.g. via a workqueue).
> Since when the scheduling (assuming it's not permanently running on
> another core due to high frequency work post) of a kernel thread is such
> a big impact that interfaces need to be redesigned for that?

Low-latency applications, for instance, care about this. Even one
context switch can add a substantial overhead.

> How much the (possible, but not certain) kernel thread context switch time
> weighs in the overall KVM IRQ service time?

Generally each one is costing me about 7us on average. For something
like high-speed networking, we have a path that has about 30us of
base-line overhead. So one additional ctx-switch puts me at base+7 ( =
~37us), two puts me in base+2*7 (= ~44us). So in that context (no pun
intended ;), it hurts quite a bit. I'll be the first to admit that not
everyone (most?) will care about latency, though. But FWIW, I do.

>> It can! :) This is not changing from whats in mainline today (covered
>> above).
> It can/could, if the signal() function takes very accurate care of doing
> the magic atomic check.

True, but thats the notifiee's burden, not eventfd's. And its always
going to be opt-in. Even today, someone is free to either try to sleep
(which will oops on the might_sleep()), or try to check if they can
sleep (it will always say they can't due to the eventfd wqh spinlock).
The only thing that changes with my patch is that someone that opts in
to check if they can sleep may find that they sometimes (mostly?) can.

In any case, I suspect that there are no other clients of eventfd other
than standard wait-queue sleepers, and irqfd. The wake_up() code is
never going to care anyway, so this really comes down to future users of
the notification interfaces (irqfd today, iosignalfd in the near-term).
Therefore, there really aren't any legacy issues to deal with that I am
aware of, even if they mattered.

Thanks Davide,

Attachment: signature.asc
Description: OpenPGP digital signature