[PATCH] af_unix: unix_write_space() use keyed wakeups

From: Eric Dumazet
Date: Sat Oct 30 2010 - 02:45:00 EST


Le vendredi 29 octobre 2010 Ã 21:27 +0200, Eric Dumazet a Ãcrit :
> Le vendredi 29 octobre 2010 Ã 19:18 +0100, Alban Crequy a Ãcrit :
> > Hi,
> >
> > When a process calls the poll or select, the kernel calls (struct
> > file_operations)->poll on every file descriptor and returns a mask of
> > events which are ready. If the process is only interested by POLLIN
> > events, the mask is still computed for POLLOUT and it can be expensive.
> > For example, on Unix datagram sockets, a process running poll() with
> > POLLIN will wakes-up when the remote end call read(). This is a
> > performance regression introduced when fixing another bug by
> > 3c73419c09a5ef73d56472dbfdade9e311496e9b and
> > ec0d215f9420564fc8286dcf93d2d068bb53a07e.
> >

unix_write_space() doesnât yet use the wake_up_interruptible_sync_poll()
to restrict wakeups to only POLLOUT | POLLWRNORM | POLLWRBAND interested
sleepers. Same for unix_dgram_recvmsg()

In your pathological case, each time the other process calls
unix_dgram_recvmsg(), it loops on 800 pollwake() /
default_wake_function() / try_to_wake_up(), which are obviously
expensive, as you pointed out with your test program, carefully designed
to show the false sharing and O(N^2) effect :)

Once do_select() thread can _really_ block, the false sharing problem
disappears for good.

We still loop on 800 items, on each wake_up_interruptible_sync_poll()
call, so maybe we want to optimize this later, adding a global key,
ORing all items keys. I dont think its worth the added complexity, given
the biased usage of your program (800 'listeners' to one event). Is it a
real life scenario ?

Thanks

[PATCH] af_unix: use keyed wakeups

Instead of wakeup all sleepers, use wake_up_interruptible_sync_poll() to
wakeup only ones interested into writing the socket.

This patch is a specialization of commit 37e5540b3c9d (epoll keyed
wakeups: make sockets use keyed wakeups).

On a test program provided by Alan Crequy :

Before:
real 0m3.101s
user 0m0.000s
sys 0m6.104s

After:

real 0m0.211s
user 0m0.000s
sys 0m0.208s

Reported-by: Alban Crequy <alban.crequy@xxxxxxxxxxxxxxx>
Signed-off-by: Eric Dumazet <eric.dumazet@xxxxxxxxx>
Cc: Davide Libenzi <davidel@xxxxxxxxxxxxxxx>
---
net/unix/af_unix.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3c95304..f33c595 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -316,7 +316,8 @@ static void unix_write_space(struct sock *sk)
if (unix_writable(sk)) {
wq = rcu_dereference(sk->sk_wq);
if (wq_has_sleeper(wq))
- wake_up_interruptible_sync(&wq->wait);
+ wake_up_interruptible_sync_poll(&wq->wait,
+ POLLOUT | POLLWRNORM | POLLWRBAND);
sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
}
rcu_read_unlock();
@@ -1710,7 +1711,8 @@ static int unix_dgram_recvmsg(struct kiocb *iocb, struct socket *sock,
goto out_unlock;
}

- wake_up_interruptible_sync(&u->peer_wait);
+ wake_up_interruptible_sync_poll(&u->peer_wait,
+ POLLOUT | POLLWRNORM | POLLWRBAND);

if (msg->msg_name)
unix_copy_addr(msg, skb->sk);


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/