Re: [PATCH 3/3] eventfd: add internal reference counting to fix notifierrace conditions

From: Gregory Haskins
Date: Fri Jun 19 2009 - 17:50:12 EST

Davide Libenzi wrote:
> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>> Davide Libenzi wrote:
>>> On Fri, 19 Jun 2009, Gregory Haskins wrote:
>>>> eventfd currently emits a POLLHUP wakeup on f_ops->release() to generate a
>>>> notifier->release() callback. This lets notification clients know if
>>>> the eventfd is about to go away and is very useful particularly for
>>>> in-kernel clients. However, as it stands today it is not possible to
>>>> use the notification API in a race-free way. This patch adds some
>>>> additional logic to the notification subsystem to rectify this problem.
>>>> Background:
>>>> -----------------------
>>>> Eventfd currently only has one reference count mechanism: fget/fput. This
>>>> in of itself is normally fine. However, if a client expects to be
>>>> notified if the eventfd is closed, it cannot hold a fget() reference
>>>> itself or the underlying f_ops->release() callback will never be invoked
>>>> by VFS. Therefore we have this somewhat unusual situation where we may
>>>> hold a pointer to an eventfd object (by virtue of having a waiter registered
>>>> in its wait-queue), but no reference. This makes it nearly impossible to
>>>> design a mutual decoupling algorithm: you cannot unhook one side from the
>>>> other (or vice versa) without racing.
>>> And why is that?
>>> struct xxx {
>>> struct mutex mtx;
>>> struct file *file;
>>> ...
>>> };
>>> struct file *xxx_get_file(struct xxx *x) {
>>> struct file *file;
>>> mutex_lock(&x->mtx);
>>> file = x->file;
>>> if (!file)
>>> mutex_unlock(&x->mtx);
>>> return file;
>>> }
>>> void xxx_release_file(struct xxx *x) {
>>> mutex_unlock(&x->mtx);
>>> }
>>> void handle_POLLHUP(struct xxx *x) {
>>> struct file *file;
>>> file = xxx_get_file(x);
>>> if (file) {
>>> unhook_waitqueue(file, ...);
>>> x->file = NULL;
>>> xxx_release_file(x);
>>> }
>>> }
>>> Every time you need to "use" file, you call xxx_get_file(), and if you get
>>> NULL, it means it's gone and you handle it accordigly to your IRQ fd
>>> policies. As soon as you done with the file, you call xxx_release_file().
>>> Replace "mtx" with the lock that fits your needs.
>> Consider what would happen if the f_ops->release() was preempted inside
>> the wake_up_locked_polled() after it dereferenced the xxx from the list,
>> but before it calls the callback(POLLHUP). The xxx object, and/or the
>> .text for the xxx object may be long gone by the time it comes back
>> around. Afaict, there is no way to guard against that scenario unless
>> you do something like 2/3+3/3. Or am I missing something?
> Right. Don't you see an easier answer to that, instead of adding 300 lines
> of code to eventfd?

I tried, but this problem made my head hurt and this was what I came up
with that I felt closes the holes all the way. Also keep in mind that
while I added X lines to eventfd, I took Y lines *out* of irqfd in the
process, too. I just excluded the KVM portions in this thread per your
request, so its not apparent. But technically, any other clients that
may come along can reuse that notification code instead of coding it
again. One way or the other, *someone* has to do that ptable_proc stuff
;) FYI: Its more like 133 lines, fwiw.

fs/eventfd.c | 104
include/linux/eventfd.h | 36 ++++++++++++++++
2 files changed, 133 insertions(+), 7 deletions(-)

In case you care, heres what the complete solution when I include KVM
currently looks like:

fs/eventfd.c | 104 +++++++++++++++++++++++++--
include/linux/eventfd.h | 36 +++++++++
virt/kvm/eventfd.c | 181
3 files changed, 228 insertions(+), 93 deletions(-)

> For example, turning wake_up_locked() into a nornal wake_up().

I am fairly confident it is not that simple after having thought about
this issue over the last few days. But I've been wrong in the past.
Propose a patch and I will review it for races/correctness, if you
like. Perhaps a combination of that plus your asymmetrical locking
scheme would work. One of the challenges you will hit is avoiding ABBA
between your "get" lock and the wqh, but good luck!


Attachment: signature.asc
Description: OpenPGP digital signature