Re: epoll and multiple processes - eliminate unneeded process wake-ups

From: Madars Vitolins
Date: Sat Dec 05 2015 - 06:50:17 EST


Hi Jason,

I did the testing and wrote for it a blog article for this: https://mvitolin.wordpress.com/2015/12/05/endurox-testing-epollexclusive-flag/

But in summary is following:

Test case:
- One multi-threaded binary with 10 threads are doing total of 1'000'000 calls to 250 single threaded processes doing epoll() on the Posix queue
- The 'call' are basically sending a message to shared queue (to those 250 load balanced processed) and they send reply back to client thread's private queue

Tests done on following system:
- Host system: Linux Mint Mate 17.2 64bit, kernel: 3.13.0-24-generic
- CPU: Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz (two cores)
- RAM: 16 GB
- Visualization platform: Oracle Virtual Box 4.3.28
- Guest OS: Gentoo Linux 2015.03, kernel 4.3.0-gentoo, 64 bit.
- CPU for guest: Two cores
- RAM for guest: 5GB (no swap usage, free about 4GB)
- Enduro/X version: 2.3.2


Results with original kernel (no EPOLLEXCLUSIVE):
Gives:

$ time ./bankcl
...

real 14m20.561s
user 0m21.823s
sys 10m49.821s


Patched kernel version with EPOLLEXCLUSIVE flag in use:
$ time ./bankcl
...
real 0m24.953s
user 0m17.497s
sys 0m4.445s

Thus 14 minutes vs 24 seconds! So EPOLLEXCLUSIVE flag makes application to run *35 times faster*!

Guys this is MUST HAVE patch!

Thanks,
Madars



Jason Baron @ 2015-12-01 22:11 rakstÄja:
Hi Madars,

On 11/30/2015 04:28 PM, Madars Vitolins wrote:
Hi Jason,

I today did search the mail archive and checked your offered patch did on February, it basically does the some (flag for add_wait_queue_exclusive() + balance).

So I plan to run off some tests with your patch, flag on/off and will provide results. I guess if I pull up 250 or 500 processes (which could real for production environment) waiting on one Q, then there could be a notable difference in performance with EPOLLEXCLUSIVE set or not.


Sounds good. Below is an updated patch if you want to try it - it only
adds the 'EPOLLEXCLUSIVE' flag.


diff --git a/fs/eventpoll.c b/fs/eventpoll.c
index 1e009ca..265fa7b 100644
--- a/fs/eventpoll.c
+++ b/fs/eventpoll.c
@@ -92,7 +92,7 @@
*/

/* Epoll private bits inside the event mask */
-#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET)
+#define EP_PRIVATE_BITS (EPOLLWAKEUP | EPOLLONESHOT | EPOLLET | EPOLLEXCLUSIVE)

/* Maximum number of nesting allowed inside epoll sets */
#define EP_MAX_NESTS 4
@@ -1002,6 +1002,7 @@ static int ep_poll_callback(wait_queue_t *wait,
unsigned mode, int sync, void *k
unsigned long flags;
struct epitem *epi = ep_item_from_wait(wait);
struct eventpoll *ep = epi->ep;
+ int ewake = 0;

if ((unsigned long)key & POLLFREE) {
ep_pwq_from_wait(wait)->whead = NULL;
@@ -1066,8 +1067,10 @@ static int ep_poll_callback(wait_queue_t *wait,
unsigned mode, int sync, void *k
* Wake up ( if active ) both the eventpoll wait list and the ->poll()
* wait list.
*/
- if (waitqueue_active(&ep->wq))
+ if (waitqueue_active(&ep->wq)) {
+ ewake = 1;
wake_up_locked(&ep->wq);
+ }
if (waitqueue_active(&ep->poll_wait))
pwake++;

@@ -1078,6 +1081,9 @@ out_unlock:
if (pwake)
ep_poll_safewake(&ep->poll_wait);

+ if (epi->event.events & EPOLLEXCLUSIVE)
+ return ewake;
+
return 1;
}

@@ -1095,7 +1101,10 @@ static void ep_ptable_queue_proc(struct file
*file, wait_queue_head_t *whead,
init_waitqueue_func_entry(&pwq->wait, ep_poll_callback);
pwq->whead = whead;
pwq->base = epi;
- add_wait_queue(whead, &pwq->wait);
+ if (epi->event.events & EPOLLEXCLUSIVE)
+ add_wait_queue_exclusive(whead, &pwq->wait);
+ else
+ add_wait_queue(whead, &pwq->wait);
list_add_tail(&pwq->llink, &epi->pwqlist);
epi->nwait++;
} else {
@@ -1861,6 +1870,10 @@ SYSCALL_DEFINE4(epoll_ctl, int, epfd, int, op, int, fd,
if (f.file == tf.file || !is_file_epoll(f.file))
goto error_tgt_fput;

+ if ((epds.events & EPOLLEXCLUSIVE) && (op == EPOLL_CTL_MOD ||
+ (op == EPOLL_CTL_ADD && is_file_epoll(tf.file))))
+ goto error_tgt_fput;
+
/*
* At this point it is safe to assume that the "private_data" contains
* our own data structure.
diff --git a/include/uapi/linux/eventpoll.h b/include/uapi/linux/eventpoll.h
index bc81fb2..925bbfb 100644
--- a/include/uapi/linux/eventpoll.h
+++ b/include/uapi/linux/eventpoll.h
@@ -26,6 +26,9 @@
#define EPOLL_CTL_DEL 2
#define EPOLL_CTL_MOD 3

+/* Add exclusively */
+#define EPOLLEXCLUSIVE (1 << 28)
+
/*
* Request the handling of system wakeup events so as to prevent
system suspends
* from happening while those events are being processed.


During kernel hacking with debug print, with 10 processes waiting on one event source, with original kernel I did see lot un-needed processing inside of eventpoll.c, it got 10x calls to ep_poll_callback() and other stuff for single event, which results with few processes waken up in user space (count probably gets randomly depending on concurrency).


Meanwhile we are not the only ones who talk about this patch, see here: http://stackoverflow.com/questions/33226842/epollexclusive-and-epollroundrobin-flags-in-mainstream-kernel others are asking too.

So what is the current situation with your patch, what is the blocking for getting it into mainline?


If we can show some good test results here I will re-submit it.

Thanks,

-Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/