[KVM PATCH v3 2/3] eventfd: add internal reference counting to fixnotifier race conditions

From: Gregory Haskins
Date: Mon Jun 22 2009 - 12:06:35 EST

eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
"release" callback. This lets eventfd clients know if the eventfd is about
to go away and is very useful particularly for in-kernel clients. However,
as it stands today it is not possible to use this feature of eventfd in a
race-free way. This patch adds some additional logic to eventfd in order
to rectify this problem.

Eventfd currently only has one reference count mechanism: fget/fput. This
in of itself is normally fine. However, if a client expects to be
notified if the eventfd is closed, it cannot hold a fget() reference
itself or the underlying f_ops->release() callback will never be invoked
by VFS. Therefore we have this somewhat unusual situation where we may
hold a pointer to an eventfd object (by virtue of having a waiter registered
in its wait-queue), but no reference. To make matters more complicated,
the release callback is issued in an unlocked state. This makes it nearly
impossible to design a mutual decoupling algorithm: you cannot unhook one
side from the other (or vice versa) without racing.


In summary, there are two fundamental problems:

1) The POLLHUP wakeup is broadcast lockless
2) There are no references to the wait-queue-head (embedded in eventfd_ctx)

We fix this by using the locked variant of wakeup for POLLHUP, and by
adding/exposing a kref to the underlying eventfd_ctx. Clients should then
be able to govern their usage of the wait-queue as they do for any other
wait-queue in the kernel.

We propose this more raw solution rather than trying to encapsulate the
poll-callback because there are advantages to decoupling the
remove_wait_queue from the kref_put(). Namely, its nice to unhook the
wait-queue inside the wakeup, but to defer the kref_put() until we can
synchronize with the client.

Between these points, we believe we now have a race-free release

Signed-off-by: Gregory Haskins <ghaskins@xxxxxxxxxx>
CC: Davide Libenzi <davidel@xxxxxxxxxxxxxxx>

fs/eventfd.c | 43 ++++++++++++++++++++++++++++++++++++-------
include/linux/eventfd.h | 7 +++++++
2 files changed, 43 insertions(+), 7 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index 72f5f8d..4806116 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -17,8 +17,10 @@
#include <linux/eventfd.h>
#include <linux/syscalls.h>
#include <linux/module.h>
+#include <linux/kref.h>

struct eventfd_ctx {
+ struct kref kref;
wait_queue_head_t wqh;
* Every time that a write(2) is performed on an eventfd, the
@@ -59,17 +61,24 @@ int eventfd_signal(struct file *file, int n)

+static void _eventfd_release(struct kref *kref)
+ struct eventfd_ctx *ctx = container_of(kref, struct eventfd_ctx, kref);
+ kfree(ctx);
+static void _eventfd_put(struct kref *kref)
+ kref_put(kref, &_eventfd_release);
static int eventfd_release(struct inode *inode, struct file *file)
struct eventfd_ctx *ctx = file->private_data;

- /*
- * No need to hold the lock here, since we are on the file cleanup
- * path and the ones still attached to the wait queue will be
- * serialized by wake_up_locked_poll().
- */
- wake_up_locked_poll(&ctx->wqh, POLLHUP);
- kfree(ctx);
+ wake_up_poll(&ctx->wqh, POLLHUP);
+ _eventfd_put(&ctx->kref);
return 0;

@@ -209,6 +218,26 @@ struct file *eventfd_fget(int fd)

+struct kref *eventfd_kref_get(struct file *file)
+ struct eventfd_ctx *ctx;
+ if (file->f_op != &eventfd_fops)
+ return ERR_PTR(-EINVAL);
+ ctx = file->private_data;
+ kref_get(&ctx->kref);
+ return &ctx->kref;
+void eventfd_kref_put(struct kref *kref)
+ _eventfd_put(kref);
SYSCALL_DEFINE2(eventfd2, unsigned int, count, int, flags)
int fd;
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index f45a8ae..c0396b3 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -8,6 +8,8 @@

+#include <linux/kref.h>

@@ -28,11 +30,16 @@

struct file *eventfd_fget(int fd);
+struct kref *eventfd_kref_get(struct file *file);
+void eventfd_kref_put(struct kref *kref);
int eventfd_signal(struct file *file, int n);

#else /* CONFIG_EVENTFD */

#define eventfd_fget(fd) ERR_PTR(-ENOSYS)
+#define eventfd_kref_get(file) ERR_PTR(-ENOSYS);
+static inline void eventfd_kref_put(struct kref *kref)
+{ }
static inline int eventfd_signal(struct file *file, int n)
{ return 0; }

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/