[RFC PATCH] sched/wait: Introduce new, more compact wait_event*() primitives
From: Ingo Molnar
Date: Wed Mar 08 2017 - 03:44:48 EST
* Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> On Fri, Mar 3, 2017 at 12:13 PM, Linus Torvalds
> <torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
> >
> > The fact that you can include <linux/wait.h>, and then cannot use the
> > wait event functions because you're missing "signal_pending()" is
> > complete garbage. This needs to be fixed.
>
> Here's a (totally untested) patch that tries to do exactly that.
Sorry, I didn't forget about this problem, I worked on it over the weekend, it
just turned out to be a lot more complex than I expected.
I took a more radical approach than you, and here's the patches I have so far:
0c32c5717beb sched/wait: Introduce new, more compact wait_event*() primitives
67b01ca3fda2 sched/wait: Disambiguate wq_entry->task_list and wq_head->task_list naming
48af925509bc sched/wait: Move bit_wait_table[] and related functionality from sched/core.c to sched/wait_bit.c
bc151d1fd3e9 sched/wait: Split out the wait_bit*() APIs from <linux/wait.h> into <linux/wait_bit.h>
b2d294b5824c sched/wait: Re-adjust macro line continuation backslashes in <linux/wait.h>
e9bbf53778c2 sched/wait: Improve the bit-wait API parameter names in the API function prototypes
6a6899db8e5a sched/wait: Standardize wait_bit_queue naming
8e060b6e033c sched/wait: Standardize 'struct wait_bit_queue' wait-queue entry field name
139793f6ac6f sched/wait: Standardize internal naming of wait-queue heads
bb393b1a7e11 sched/wait: Standardize internal naming of wait-queue entries
28376289373c sched/wait: Rename wait_queue_t => wait_queue_entry_t
You can find the latest (WIP, frequently rebased) version in:
git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.sched/core
The first 10 patches reshape the waitqueue code to be more hackable (to me!),
because I kept bumping into confusing names all the time, but the real change
you'd be most interested in is, in an RFC (to be rebased!) form, the following
one:
0c32c5717beb sched/wait: Introduce new, more compact wait_event*() primitives
I've attached that patch below as well.
The idea is to allow call sites to supply the 'condition' function as free-form C
code, while pushing everything else into non-macro form: there's a 'struct
wait_event_state' on stack, and a state machine. The waiting logic is converted
from procedural form to a state machine, because we have to call out into the
'condition' code in different circumstances.
It's a pretty weird construct, but it works I think, and has quite a few
advantages (and a few disadvantages ...).
There's a lot of caveats with this RFC patch:
- Not signed off, I'm not 100% sure it's fully correct yet, but superficial
testing suggests that it appears to boot - but PeterZ saw a boot hang on a
server machine ...
- NOTE: the v1/v2 structure is just a hack to make it easier to test - the final
version wouldn't have such a construct.
- Peter doesn't like the bool's in the struct - we could use char instead.
- It needs a proper state machine diagram and more documentation - this is just a
very quick prototype to see whether the concept has any chance of working.
The main advantage is the much reduced call site overhead, plus the fact that we
move much of the logic into the .c space.
Right now event_wait() [which is in fact the smallest wait-event primitive!]
generates this humunguous amount of inlined code on x86-64 defconfig:
long test_flag;
DECLARE_WAIT_QUEUE_HEAD(test_waitqueue_head);
void test_function(void)
{
wait_event(test_waitqueue_head, test_flag != 0);
}
00000000000084e0 <test_function>:
84e0: 55 push %rbp
84e1: 48 89 e5 mov %rsp,%rbp
84e4: 48 83 ec 28 sub $0x28,%rsp
84e8: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 84ef <test_function+0xf>
84ef: 85 c0 test %eax,%eax
84f1: 74 4f je 8542 <test_function+0x62>
84f3: 48 83 3d 00 00 00 00 cmpq $0x0,0x0(%rip) # 84fb <test_function+0x1b>
84fa: 00
84fb: 74 02 je 84ff <test_function+0x1f>
84fd: c9 leaveq
84fe: c3 retq
84ff: 48 8d 7d d8 lea -0x28(%rbp),%rdi
8503: 31 f6 xor %esi,%esi
8505: e8 00 00 00 00 callq 850a <test_function+0x2a>
850a: eb 05 jmp 8511 <test_function+0x31>
850c: e8 00 00 00 00 callq 8511 <test_function+0x31>
8511: 48 8d 75 d8 lea -0x28(%rbp),%rsi
8515: ba 02 00 00 00 mov $0x2,%edx
851a: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
8521: e8 00 00 00 00 callq 8526 <test_function+0x46>
8526: 48 83 3d 00 00 00 00 cmpq $0x0,0x0(%rip) # 852e <test_function+0x4e>
852d: 00
852e: 74 dc je 850c <test_function+0x2c>
8530: 48 8d 75 d8 lea -0x28(%rbp),%rsi
8534: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
853b: e8 00 00 00 00 callq 8540 <test_function+0x60>
8540: c9 leaveq
8541: c3 retq
8542: e8 00 00 00 00 callq 8547 <test_function+0x67>
8547: eb aa jmp 84f3 <test_function+0x13>
That's 5 inlined function calls (!).
Note that it's even worse with some of the more complex wait_event*() constructs,
for example the popular wait_event_interruptible_timeout() API generates this much
code per call site:
00000000000084d0 <test_function_timeout>:
84d0: 55 push %rbp
84d1: 48 89 e5 mov %rsp,%rbp
84d4: 53 push %rbx
84d5: 48 83 ec 28 sub $0x28,%rsp
84d9: 65 8b 05 00 00 00 00 mov %gs:0x0(%rip),%eax # 84e0 <test_function_timeout+0x10>
84e0: 85 c0 test %eax,%eax
84e2: 0f 84 a0 00 00 00 je 8588 <test_function_timeout+0xb8>
84e8: 48 83 3d 00 00 00 00 cmpq $0x0,0x0(%rip) # 84f0 <test_function_timeout+0x20>
84ef: 00
84f0: 74 07 je 84f9 <test_function_timeout+0x29>
84f2: 48 83 c4 28 add $0x28,%rsp
84f6: 5b pop %rbx
84f7: 5d pop %rbp
84f8: c3 retq
84f9: 48 8d 7d d0 lea -0x30(%rbp),%rdi
84fd: 31 f6 xor %esi,%esi
84ff: bb 10 27 00 00 mov $0x2710,%ebx
8504: e8 00 00 00 00 callq 8509 <test_function_timeout+0x39>
8509: 48 8d 75 d0 lea -0x30(%rbp),%rsi
850d: ba 01 00 00 00 mov $0x1,%edx
8512: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
8519: e8 00 00 00 00 callq 851e <test_function_timeout+0x4e>
851e: 48 83 3d 00 00 00 00 cmpq $0x0,0x0(%rip) # 8526 <test_function_timeout+0x56>
8525: 00
8526: 0f 95 c2 setne %dl
8529: 31 c9 xor %ecx,%ecx
852b: 84 d2 test %dl,%dl
852d: 75 42 jne 8571 <test_function_timeout+0xa1>
852f: 84 c9 test %cl,%cl
8531: 75 3e jne 8571 <test_function_timeout+0xa1>
8533: 48 85 c0 test %rax,%rax
8536: 75 ba jne 84f2 <test_function_timeout+0x22>
8538: 48 89 df mov %rbx,%rdi
853b: e8 00 00 00 00 callq 8540 <test_function_timeout+0x70>
8540: 48 8d 75 d0 lea -0x30(%rbp),%rsi
8544: ba 01 00 00 00 mov $0x1,%edx
8549: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
8550: 48 89 c3 mov %rax,%rbx
8553: e8 00 00 00 00 callq 8558 <test_function_timeout+0x88>
8558: 48 83 3d 00 00 00 00 cmpq $0x0,0x0(%rip) # 8560 <test_function_timeout+0x90>
855f: 00
8560: 0f 95 c2 setne %dl
8563: 48 85 db test %rbx,%rbx
8566: 0f 94 c1 sete %cl
8569: 84 d2 test %dl,%dl
856b: 74 c2 je 852f <test_function_timeout+0x5f>
856d: 84 c9 test %cl,%cl
856f: 74 ba je 852b <test_function_timeout+0x5b>
8571: 48 8d 75 d0 lea -0x30(%rbp),%rsi
8575: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
857c: e8 00 00 00 00 callq 8581 <test_function_timeout+0xb1>
8581: 48 83 c4 28 add $0x28,%rsp
8585: 5b pop %rbx
8586: 5d pop %rbp
8587: c3 retq
8588: e8 00 00 00 00 callq 858d <test_function_timeout+0xbd>
858d: e9 56 ff ff ff jmpq 84e8 <test_function_timeout+0x18>
Which is about 50 instructions (!!!), and about 180 bytes of text overheaed.
We have over 1,900 wait_event() call sites:
triton:~/tip> git grep -E 'wait_event.*\(' | wc -l
1972
which means that with an average per call site overhead of 100 bytes code, this
means a total text bloat of about ~190 KB...
This was pretty much hidden from common forms of bloat and overhead analysis so
far, because it's all inlined at the CPP level and hidden 'inside' various normal
functions.
With my patch it's just a loop with a single function call, and much lower call
site impact:
0000000000008490 <test_function>:
8490: 55 push %rbp
8491: 48 89 e5 mov %rsp,%rbp
8494: 48 83 ec 38 sub $0x38,%rsp
8498: c6 45 c8 00 movb $0x0,-0x38(%rbp)
849c: 31 d2 xor %edx,%edx
849e: 48 83 3d 00 00 00 00 cmpq $0x0,0x0(%rip) # 84a6 <test_function+0x16>
84a5: 00
84a6: 48 8d 75 c8 lea -0x38(%rbp),%rsi
84aa: 48 c7 c7 00 00 00 00 mov $0x0,%rdi
84b1: 0f 95 c2 setne %dl
84b4: e8 00 00 00 00 callq 84b9 <test_function+0x29>
84b9: 80 7d ca 00 cmpb $0x0,-0x36(%rbp)
84bd: 74 dd je 849c <test_function+0xc>
84bf: c9 leaveq
84c0: c3 retq
Which is about 9 instructions.
So with the wait-event state machine it's a very sweet, compact construct with
only a single function call - 9 instructions total, even with more complex API
variants. The original version generates over 20-50 instructions depending on
complexity.
Also note that if we uninline much of the iterator with my approach, we could more
aggressively inline _that_ central state machine function and get rid of the many
function calls it does internally.
I.e. it's not just debloating but an all around speedup for the actual-waiting
case, which does happen frequently. For the doesn't-wait case there's a bit more
overhead due to the extra function call.
But, the never ending quest of an uncaring universe makes our life more difficult,
because there are disadvantages of a state machine as well:
- State machines are so hellishly difficult to read, to get right and to
maintain.
- While most wait_event() constructs actually do end up schedule()ing for real,
but still the 'wait condition is already true' case gets burdened with an extra
function call, because we have to call into wait.c. We could inline that first
check, at the cost of more inlining overhead.
OTOH with this method we'd only have to get the state machine right roughly once,
and have all the API variants in the wait.c space. We could use proper function
pointers with inlining and other meta coding constructs within wait.c to reduce
the number of variants in the source code space without adding runtime overhead.
In fact I think we could optimize and form these functions a lot better within
wait.c than in the macro space.
In any case, it's clear that this stuff is in no way v4.11 material, so as a
bridging fix I propose we add a sched/signal.h include to wait.h (or just move
signal_pending() temporarily), until it's all resolved for real for v4.12.
Thoughts?
Thanks,
Ingo
=================>