Re: Additional compiler barrier required in sched_preempt_enable_no_resched?
From: Vikram Mulukutla
Date: Sat May 14 2016 - 14:28:55 EST
On 5/14/2016 8:39 AM, Thomas Gleixner wrote:
On Fri, 13 May 2016, Vikram Mulukutla wrote:
On 5/13/2016 7:58 AM, Peter Zijlstra wrote:
diff --git a/include/asm-generic/preempt.h b/include/asm-generic/preempt.h
index 5d8ffa3e6f8c..c1cde3577551 100644
--- a/include/asm-generic/preempt.h
+++ b/include/asm-generic/preempt.h
@@ -7,10 +7,10 @@
static __always_inline int preempt_count(void)
{
- return current_thread_info()->preempt_count;
+ return READ_ONCE(current_thread_info()->preempt_count);
}
-static __always_inline int *preempt_count_ptr(void)
+static __always_inline volatile int *preempt_count_ptr(void)
{
return ¤t_thread_info()->preempt_count;
}
Thanks Peter, this patch worked for me. The compiler no longer optimizes out
the increment/decrement of the preempt_count.
I have a hard time to understand why the compiler optimizes out stuff w/o that
patch.
We already have:
#define preempt_disable() \
do { \
preempt_count_inc(); \
barrier(); \
} while (0)
#define sched_preempt_enable_no_resched() \
do { \
barrier(); \
preempt_count_dec(); \
} while (0)
#define preempt_enable() \
do { \
barrier(); \
if (unlikely(preempt_count_dec_and_test())) \
__preempt_schedule(); \
} while (0)
So the barriers already forbid that the compiler reorders code. How on earth
is the compiler supposed to optimize the dec/inc out?
While I cannot claim more than a very rudimentary knowledge of
compilers, this was the initial reaction that I had as well. But then
the barriers are in the wrong spots for the way the code was used in the
driver in question. preempt_disable has the barrier() _after_ the
increment, and sched_preempt_enable_no_resched has it _before_ the
decrement. Therefore, if one were to use preempt_enable_no_resched
followed by preempt_disable, there is no compiler barrier between the
decrement and the increment statements. Whether this should translate to
such a seemingly aggressive optimization is something I'm not completely
sure of.
There is more code than the preempt stuff depending on barrier() and expecting
that the compiler does not optimize out things at will. Are we supposed to
hunt all occurences and amend them with READ_ONCE or whatever one by one?
I think the barrier is working as intended for the sequence of
preempt_disable followed by preempt_enable_no_resched.
Thanks,
tglx
As a simple experiment I used gcc on x86 on the following C program.
This was really to convince myself rather than you and Peter!
#include <stdio.h>
#define barrier() __asm__ __volatile__("": : :"memory")
struct thread_info {
int pc;
};
#define preempt_count() \
ti_ptr->pc
#define preempt_disable() \
do \
{ \
preempt_count() += 1; \
barrier(); \
} \
while (0)
#define preempt_enable() \
do \
{ \
barrier(); \
preempt_count() -= 1; \
} \
while (0)
struct thread_info *ti_ptr;
int main(void)
{
struct thread_info ti;
ti_ptr = &ti;
int b;
preempt_enable();
b = b + 500;
preempt_disable();
printf("b = %d\n", b);
return 0;
}
With gcc -O2 I get this (the ARM compiler behaves similarly):
0000000000400470 <main>:
400470: 48 83 ec 18 sub $0x18,%rsp
400474: 48 89 25 cd 0b 20 00 mov %rsp,0x200bcd(%rip)
40047b: ba f4 01 00 00 mov $0x1f4,%edx
400480: be 14 06 40 00 mov $0x400614,%esi
400485: bf 01 00 00 00 mov $0x1,%edi
40048a: 31 c0 xor %eax,%eax
40048c: e8 cf ff ff ff callq 400460 <__printf_chk@plt>
400491: 31 c0 xor %eax,%eax
400493: 48 83 c4 18 add $0x18,%rsp
400497: c3 retq
If I swap preempt_enable and preempt_disable I get this:
0000000000400470 <main>:
400470: 48 83 ec 18 sub $0x18,%rsp
400474: 48 89 25 cd 0b 20 00 mov %rsp,0x200bcd(%rip)
40047b: 83 04 24 01 addl $0x1,(%rsp)
40047f: 48 8b 05 c2 0b 20 00 mov 0x200bc2(%rip),%rax
400486: ba f4 01 00 00 mov $0x1f4,%edx
40048b: be 14 06 40 00 mov $0x400614,%esi
400490: bf 01 00 00 00 mov $0x1,%edi
400495: 83 28 01 subl $0x1,(%rax)
400498: 31 c0 xor %eax,%eax
40049a: e8 c1 ff ff ff callq 400460 <__printf_chk@plt>
40049f: 31 c0 xor %eax,%eax
4004a1: 48 83 c4 18 add $0x18,%rsp
4004a5: c3 retq
Note: If I place ti_ptr on the stack, the ordering/barriers no longer
matter, the inc/dec is always optimized out. I suppose the compiler does
treat current_thread_info as coming from a different memory location
rather than the current stack frame. In any case, IMHO the volatile
preempt_count patch is necessary.
Thanks,
Vikram
--