lockless poll() (was Re: namei() query)

From: kumon@flab.fujitsu.co.jp
Date: Mon Apr 24 2000 - 07:36:00 EST


Linus Torvalds writes:
> On Fri, 21 Apr 2000, Manfred Spraul wrote:
> > kumon@flab.fujitsu.co.jp wrote:
> > > But I don't realy understand, what portion actually needs the lock?
> > Only the innermost "->poll()" call needs the lock.
>
> Actually, that is something we might change right now - the read() and

We've measured lockless poll() performance, here we show the results.

We removed kernel_lock/unlock() in the do_poll()/do_select() and also
sock_poll(). And we put correction codes around a poll() something
like:

                if (file->f_op && file->f_op->poll) {
                        if (file->f_op->poll == sock_poll) {
                                mask = file->f_op->poll(file, wait);
                        } else {
                                lock_kernel();
                                mask = file->f_op->poll(file, wait);
                                unlock_kernel();
                        }
                }

As shown in the following Stext-lock section analysis, lock waiting
actually decreased.
# now do_close is the top of stext-lock..

Lockless poll() version
;cpu0 cpu1 cpu2 cpu3 all-cpu where lockvar
2220 2192 2220 2141 8773 TOTAL
456 417 499 411 1783 do_close+324 0xc026b7e4
218 185 166 163 732 _fput+82 0xc026b7e4
171 198 179 151 699 sys_newstat+79 0xc026b7e4
172 146 190 159 667 sys_open+132 0xc026b7e4
119 147 141 164 571 old_mmap+416 0xc026b7e4
129 132 161 141 563 sock_map_fd+132 0xc026b7e4
123 140 144 151 558 tcp_accept+117 (%ecx)
145 141 116 144 546 sys_fcntl+236 0xc026b7e4
108 104 110 109 431 wait_for_tcp_memory+797 (%ecx)
79 71 69 118 337 schedule+2000 0xc026b7e4

Original version
;cpu0 cpu1 cpu2 cpu3 all-cpu where lockvar
3362 3409 3276 3381 13428 TOTAL
504 549 490 592 2135 schedule+2000 0xc026e7e4
384 387 367 407 1545 do_close+324 0xc026e7e4
355 351 325 343 1374 sock_poll+232 0xc026e7e4
322 297 299 265 1183 do_select+368 0xc026e7e4
226 242 179 229 876 sys_newstat+79 0xc026e7e4
207 190 186 184 767 sys_open+132 0xc026e7e4
161 172 178 205 716 sys_fcntl+236 0xc026e7e4
184 170 182 165 701 sock_map_fd+132 0xc026e7e4
162 179 170 150 661 _fput+82 0xc026e7e4

In the experiments, the total clients performance limits the apache
throughput. So, the performance value does not increase with the
optimization but the idle time increased.

the followings are the current execution status.
        2.3.40-orig 2.3.40-Ds 2.3.40-Po
--------------------------------------------------
User 17.2% 17.3% 17.3%
System 29.2% 28.9% 27.6%
Idle 60.5% 60.7% 62.1%
Trans/s 1524 1527 1525

2.3.40-orig: original 2.3.40 with movb optimize and separate kernel
                signature optimize included.
2.3.40-Ds: kernel-lock/unlocks are moved inside of for(;;)-loop.
2.3.40-Po: Remove kernel-lock from socket_poll().

In the heavy duty case, csum_partial_copy_generic() becomes the new
winner of the worst time consuming function with the poll()
optimization. We are arranging the global figure now.

Though csum_partial_copy_generic() is highly optimized with
hand-crafted code, it eats lots of time. It may be inevitable, but may
be reducible. We are now investigating why it does.

# In a trivial benchmark, the system time is cut by almost half with
# the optimization. Which does only unix_domain send/recv among lots
# of processes, the processes wait on select().

--
Computer Systems Laboratory, Fujitsu Labs.
kumon@flab.fujitsu.co.jp

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sun Apr 30 2000 - 21:00:07 EST