Re: [patch 10/13] signal/timer/event fds v8 - eventfd core ...

From: Andrew Morton
Date: Fri Mar 30 2007 - 15:41:54 EST


On Tue, 20 Mar 2007 11:37:14 -0700
Davide Libenzi <davidel@xxxxxxxxxxxxxxx> wrote:

> This is a very simple and light file descriptor, that can be used as
> event wait/dispatch by userspace (both wait and dispatch) and by the
> kernel (dispatch only). It can be used instead of pipe(2) in all cases
> where those would simply be used to signal events. Their kernel overhead
> is much lower than pipes, and they do not consume two fds. When used in
> the kernel, it can offer an fd-bridge to enable, for example, functionalities
> like KAIO or syslets/threadlets to signal to an fd the completion of certain
> operations. But more in general, an eventfd can be used by the kernel to
> signal readiness, in a POSIX poll/select way, of interfaces that would
> otherwise be incompatible with it. The API is:
>
> int eventfd(unsigned int count);
>
> The eventfd API accepts an initial "count" parameter, and returns an
> eventfd fd. It supports poll(2) (POLLIN, POLLOUT, POLLERR), read(2) and write(2).
> The POLLIN flag is raised when the internal counter is greater than zero.
> The POLLOUT flag is raised when at least a value of "1" can be written to
> the internal counter.
> The POLLERR flag is raised when an overflow in the counter value is detected.
> The write(2) operation can never overflow the counter, since it blocks
> (unless O_NONBLOCK is set, in which case -EAGAIN is returned).
> But the eventfd_signal() function can do it, since it's supposed to not
> sleep during its operation.
> The read(2) function reads the __u64 counter value, and reset the internal
> value to zero. If the value read is equal to (__u64) -1, an overflow
> happened on the internal counter (due to 2^64 eventfd_signal() posts
> that has never been retired - unlickely, but possible).
> The write(2) call writes an __u64 count value, and adds it
> to the current counter. The eventfd fd supports O_NONBLOCK also.
> On the kernel side, we have:
>
> struct file *eventfd_fget(int fd);
> int eventfd_signal(struct file *file, unsigned int n);
>
> The eventfd_fget() should be called to get a struct file* from an eventfd
> fd (this is an fget() + check of f_op being an eventfd fops pointer).
> The kernel can then call eventfd_signal() every time it wants to post
> an event to userspace. The eventfd_signal() function can be called from any
> context.
> An eventfd() simple test and bench is available here:
>
> http://www.xmailserver.org/eventfd-bench.c
>
> This is the eventfd-based version of pipetest-4 (pipe(2) based):
>
> http://www.xmailserver.org/pipetest-4.c
>
> Not that performance matters much in the eventfd case, but eventfd-bench
> shows almost as double as performance than pipetest-4.
>
>...
>
> Index: linux-2.6.21-rc3.quilt/fs/eventfd.c
> ===================================================================
> --- /dev/null 1970-01-01 00:00:00.000000000 +0000
> +++ linux-2.6.21-rc3.quilt/fs/eventfd.c 2007-03-19 19:01:58.000000000 -0700
> @@ -0,0 +1,237 @@
> +/*
> + * fs/eventfd.c
> + *
> + * Copyright (C) 2007 Davide Libenzi <davidel@xxxxxxxxxxxxxxx>
> + *
> + */
> +
> +#include <linux/file.h>
> +#include <linux/poll.h>
> +#include <linux/slab.h>
> +#include <linux/init.h>
> +#include <linux/fs.h>
> +#include <linux/mount.h>
> +#include <linux/module.h>
> +#include <linux/kernel.h>
> +#include <linux/list.h>
> +#include <linux/spinlock.h>
> +#include <linux/jiffies.h>
> +#include <linux/anon_inodes.h>
> +#include <linux/eventfd.h>
> +
> +#include <asm/uaccess.h>
> +
> +
> +
> +struct eventfd_ctx {
> + spinlock_t lock;
> + wait_queue_head_t wqh;
> + __u64 count;
> +};

Again, can we borrow wqh.lock?

`count' needs documentation - these things are key to understanding the
code.

> +
> +
> +static int eventfd_close(struct inode *inode, struct file *file);
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait);
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> + loff_t *ppos);
> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> + loff_t *ppos);

dittos

> +
> +
> +

dittos

> +static const struct file_operations eventfd_fops = {
> + .release = eventfd_close,

dittos

> + .poll = eventfd_poll,
> + .read = eventfd_read,
> + .write = eventfd_write,
> +};
> +
> +
> +
> +struct file *eventfd_fget(int fd)
> +{
> + struct file *file;
> +
> + file = fget(fd);
> + if (!file)
> + return ERR_PTR(-EBADF);
> + if (file->f_op != &eventfd_fops) {
> + fput(file);
> + return ERR_PTR(-EINVAL);
> + }
> +
> + return file;
> +}
> +
> +/*
> + * This function is supposed to be called by the kernel in paths
> + * that do not allow to sleep. In this function we allow the counter

"that do not allow sleeping"

> + * to reach the ULLONG_MAX value, and we signal this as overflow
> + * condition by returining a POLLERR to poll(2).

typo

> + */

So it is the caller's responsibility to ensure that *file refers to an
eventfd file?

> +int eventfd_signal(struct file *file, int n)
> +{
> + struct eventfd_ctx *ctx = file->private_data;
> + unsigned long flags;
> +
> + if (n < 0)
> + return -EINVAL;
> + spin_lock_irqsave(&ctx->lock, flags);
> + if (ULLONG_MAX - ctx->count < n)
> + n = (int) (ULLONG_MAX - ctx->count);
> + ctx->count += n;
> + if (waitqueue_active(&ctx->wqh))
> + wake_up_locked(&ctx->wqh);
> + spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + return n;
> +}

Neither the incoming arg (usefully named "n") nor the return value are
documented.


> +asmlinkage long sys_eventfd(unsigned int count)
> +{
> + int error, fd;
> + struct eventfd_ctx *ctx;
> + struct file *file;
> + struct inode *inode;
> +
> + ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);
> + if (!ctx)
> + return -ENOMEM;
> +
> + init_waitqueue_head(&ctx->wqh);
> + spin_lock_init(&ctx->lock);
> + ctx->count = count;
> +
> + /*
> + * When we call this, the initialization must be complete, since
> + * aino_getfd() will install the fd.
> + */
> + error = aino_getfd(&fd, &inode, &file, "[eventfd]",
> + &eventfd_fops, ctx);
> + if (!error)
> + return fd;
> +
> + kfree(ctx);
> + return error;
> +}

Documentation?

> +static int eventfd_close(struct inode *inode, struct file *file)
> +{
> + kfree(file->private_data);
> + return 0;
> +}
> +
> +static unsigned int eventfd_poll(struct file *file, poll_table *wait)
> +{
> + struct eventfd_ctx *ctx = file->private_data;
> + unsigned int events = 0;
> + unsigned long flags;
> +
> + poll_wait(file, &ctx->wqh, wait);
> +
> + spin_lock_irqsave(&ctx->lock, flags);
> + if (ctx->count > 0)
> + events |= POLLIN;
> + if (ctx->count == ULLONG_MAX)
> + events |= POLLERR;
> + if (ULLONG_MAX - 1 > ctx->count)
> + events |= POLLOUT;
> + spin_unlock_irqrestore(&ctx->lock, flags);
> +
> + return events;
> +}
> +
> +static ssize_t eventfd_read(struct file *file, char __user *buf, size_t count,
> + loff_t *ppos)
> +{
> + struct eventfd_ctx *ctx = file->private_data;
> + ssize_t res;
> + __u64 ucnt;
> + DECLARE_WAITQUEUE(wait, current);
> +
> + if (count < sizeof(ucnt))
> + return -EINVAL;
> + spin_lock_irq(&ctx->lock);
> + res = -EAGAIN;
> + ucnt = ctx->count;
> + if (ucnt > 0)
> + res = sizeof(ucnt);
> + else if (!(file->f_flags & O_NONBLOCK)) {
> + __add_wait_queue(&ctx->wqh, &wait);
> + for (res = 0;;) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + if (ctx->count > 0) {
> + ucnt = ctx->count;
> + res = sizeof(ucnt);
> + break;
> + }
> + if (signal_pending(current)) {
> + res = -ERESTARTSYS;
> + break;
> + }
> + spin_unlock_irq(&ctx->lock);
> + schedule();
> + spin_lock_irq(&ctx->lock);
> + }
> + __remove_wait_queue(&ctx->wqh, &wait);
> + __set_current_state(TASK_RUNNING);
> + }
> + if (res > 0) {
> + ctx->count = 0;
> + if (waitqueue_active(&ctx->wqh))
> + wake_up_locked(&ctx->wqh);
> + }
> + spin_unlock_irq(&ctx->lock);
> + if (res > 0 && put_user(ucnt, (__u64 __user *) buf))
> + return -EFAULT;
> +
> + return res;
> +}

Needs interface documentation, please. Even the changelog doesn't tell us
what an EAGAIN return from read() means.

> +static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
> + loff_t *ppos)
> +{
> + struct eventfd_ctx *ctx = file->private_data;
> + ssize_t res;
> + __u64 ucnt;
> + DECLARE_WAITQUEUE(wait, current);
> +
> + if (count < sizeof(ucnt))
> + return -EINVAL;
> + if (get_user(ucnt, (const __u64 __user *) buf))
> + return -EFAULT;

Some architectures do not implement 64-bit get_user()

> + if (ucnt == ULLONG_MAX)
> + return -EINVAL;
> + spin_lock_irq(&ctx->lock);
> + res = -EAGAIN;
> + if (ULLONG_MAX - ctx->count > ucnt)
> + res = sizeof(ucnt);
> + else if (!(file->f_flags & O_NONBLOCK)) {
> + __add_wait_queue(&ctx->wqh, &wait);
> + for (res = 0;;) {
> + set_current_state(TASK_INTERRUPTIBLE);
> + if (ULLONG_MAX - ctx->count > ucnt) {
> + res = sizeof(ucnt);
> + break;
> + }
> + if (signal_pending(current)) {
> + res = -ERESTARTSYS;
> + break;
> + }
> + spin_unlock_irq(&ctx->lock);
> + schedule();
> + spin_lock_irq(&ctx->lock);
> + }
> + __remove_wait_queue(&ctx->wqh, &wait);
> + __set_current_state(TASK_RUNNING);
> + }
> + if (res > 0) {
> + ctx->count += ucnt;
> + if (waitqueue_active(&ctx->wqh))
> + wake_up_locked(&ctx->wqh);
> + }
> + spin_unlock_irq(&ctx->lock);
> +
> + return res;
> +}
> +

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/