Re: [PATCHv5 2/2] memory barrier: adding smp_mb__after_lock

From: Mathieu Desnoyers
Date: Tue Jul 07 2009 - 10:57:38 EST


* Eric Dumazet (eric.dumazet@xxxxxxxxx) wrote:
> Mathieu Desnoyers a écrit :
> > * Jiri Olsa (jolsa@xxxxxxxxxx) wrote:
> >> On Tue, Jul 07, 2009 at 12:18:16PM +0200, Jiri Olsa wrote:
> >>> On Fri, Jul 03, 2009 at 01:18:48PM +0200, Jiri Olsa wrote:
> >>>> On Fri, Jul 03, 2009 at 12:25:30PM +0200, Ingo Molnar wrote:
> >>>>> * Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> >>>>>
> >>>>>> On Fri, Jul 03, 2009 at 11:24:38AM +0200, Ingo Molnar wrote:
> >>>>>>> * Eric Dumazet <eric.dumazet@xxxxxxxxx> wrote:
> >>>>>>>
> >>>>>>>> Ingo Molnar a écrit :
> >>>>>>>>> * Jiri Olsa <jolsa@xxxxxxxxxx> wrote:
> >>>>>>>>>
> >>>>>>>>>> +++ b/arch/x86/include/asm/spinlock.h
> >>>>>>>>>> @@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >>>>>>>>>> #define _raw_read_relax(lock) cpu_relax()
> >>>>>>>>>> #define _raw_write_relax(lock) cpu_relax()
> >>>>>>>>>>
> >>>>>>>>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> >>>>>>>>>> +#define smp_mb__after_lock() do { } while (0)
> >>>>>>>>> Two small stylistic comments, please make this an inline function:
> >>>>>>>>>
> >>>>>>>>> static inline void smp_mb__after_lock(void) { }
> >>>>>>>>> #define smp_mb__after_lock
> >>>>>>>>>
> >>>>>>>>> (untested)
> >>>>>>>>>
> >>>>>>>>>> +/* The lock does not imply full memory barrier. */
> >>>>>>>>>> +#ifndef smp_mb__after_lock
> >>>>>>>>>> +#define smp_mb__after_lock() smp_mb()
> >>>>>>>>>> +#endif
> >>>>>>>>> ditto.
> >>>>>>>>>
> >>>>>>>>> Ingo
> >>>>>>>> This was following existing implementations of various smp_mb__??? helpers :
> >>>>>>>>
> >>>>>>>> # grep -4 smp_mb__before_clear_bit include/asm-generic/bitops.h
> >>>>>>>>
> >>>>>>>> /*
> >>>>>>>> * clear_bit may not imply a memory barrier
> >>>>>>>> */
> >>>>>>>> #ifndef smp_mb__before_clear_bit
> >>>>>>>> #define smp_mb__before_clear_bit() smp_mb()
> >>>>>>>> #define smp_mb__after_clear_bit() smp_mb()
> >>>>>>>> #endif
> >>>>>>> Did i mention that those should be fixed too? :-)
> >>>>>>>
> >>>>>>> Ingo
> >>>>>> ok, could I include it in the 2/2 or you prefer separate patch?
> >>>>> depends on whether it will regress ;-)
> >>>>>
> >>>>> If it regresses, it's better to have it separate. If it wont, it can
> >>>>> be included. If unsure, default to the more conservative option.
> >>>>>
> >>>>> Ingo
> >>>>
> >>>> how about this..
> >>>> and similar change for smp_mb__before_clear_bit in a separate patch
> >>>>
> >>>>
> >>>> diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
> >>>> index b7e5db8..4e77853 100644
> >>>> --- a/arch/x86/include/asm/spinlock.h
> >>>> +++ b/arch/x86/include/asm/spinlock.h
> >>>> @@ -302,4 +302,8 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
> >>>> #define _raw_read_relax(lock) cpu_relax()
> >>>> #define _raw_write_relax(lock) cpu_relax()
> >>>>
> >>>> +/* The {read|write|spin}_lock() on x86 are full memory barriers. */
> >>>> +static inline void smp_mb__after_lock(void) { }
> >>>> +#define ARCH_HAS_SMP_MB_AFTER_LOCK
> >>>> +
> >>>> #endif /* _ASM_X86_SPINLOCK_H */
> >>>> diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
> >>>> index 252b245..4be57ab 100644
> >>>> --- a/include/linux/spinlock.h
> >>>> +++ b/include/linux/spinlock.h
> >>>> @@ -132,6 +132,11 @@ do { \
> >>>> #endif /*__raw_spin_is_contended*/
> >>>> #endif
> >>>>
> >>>> +/* The lock does not imply full memory barrier. */
> >>>> +#ifndef ARCH_HAS_SMP_MB_AFTER_LOCK
> >>>> +static inline void smp_mb__after_lock(void) { smp_mb(); }
> >>>> +#endif
> >>>> +
> >>>> /**
> >>>> * spin_unlock_wait - wait until the spinlock gets unlocked
> >>>> * @lock: the spinlock in question.
> >>>> diff --git a/include/net/sock.h b/include/net/sock.h
> >>>> index 4eb8409..98afcd9 100644
> >>>> --- a/include/net/sock.h
> >>>> +++ b/include/net/sock.h
> >>>> @@ -1271,6 +1271,9 @@ static inline int sk_has_allocations(const struct sock *sk)
> >>>> * in its cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1
> >>>> * could then endup calling schedule and sleep forever if there are no more
> >>>> * data on the socket.
> >>>> + *
> >>>> + * The sk_has_helper is always called right after a call to read_lock, so we
> >>>> + * can use smp_mb__after_lock barrier.
> >>>> */
> >>>> static inline int sk_has_sleeper(struct sock *sk)
> >>>> {
> >>>> @@ -1280,7 +1283,7 @@ static inline int sk_has_sleeper(struct sock *sk)
> >>>> *
> >>>> * This memory barrier is paired in the sock_poll_wait.
> >>>> */
> >>>> - smp_mb();
> >>>> + smp_mb__after_lock();
> >>>> return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
> >>>> }
> >>>>
> >>> any feedback on this?
> >>> I'd send v6 if this way is acceptable..
> >>>
> >>> thanks,
> >>> jirka
> >> also I checked the smp_mb__before_clear_bit/smp_mb__after_clear_bit and
> >> it is used quite extensivelly.
> >>
> >> I'd prefer to send it in a separate patch, so we can move on with the
> >> changes I've sent so far..
> >>
> >
> > As with any optimization (and this is one that adds a semantic that will
> > just grow the memory barrier/locking rule complexity), it should come
> > with performance benchmarks showing real-life improvements.
> >
> > Otherwise I'd recommend sticking to smp_mb() if this execution path is
> > not that critical, or to move to RCU if it's _that_ critical.
> >
> > A valid argument would be if the data structures protected are so
> > complex that RCU is out of question but still the few cycles saved by
> > removing a memory barrier are really significant. And even then, the
> > proper solution would be more something like a
> > __read_lock()+smp_mb+smp_mb+__read_unlock(), so we get the performance
> > improvements on architectures other than x86 as well.
> >
> > So in all cases, I don't think the smp_mb__after_lock() is the
> > appropriate solution.
>
> RCU on this part is out of the question, as David already mentioned it.
>

OK

> It would be a regression for short lived tcp/udp sessions, and some workloads
> use them a lot...
>
> We gained about 20% performance between 2.6.26 and 2.6.31, carefuly removing
> some atomic ops in network stack, adding RCU where it was sensible, but this
> is a painful process, not something Jiri can use to fix bugs on legacy RedHat
> kernels :) (We still are sorting out regressions)
>

Yep, I can understand that. Tbench on localhost is an especially good
benchmark for this ;)

> To solve problem pointed by Jiri, we have to insert an smp_mb() at this point,
> (not mentioning the other change in select() logic of course)
>
> static void sock_def_readable(struct sock *sk, int len)
> {
> read_lock(&sk->sk_callback_lock);
> + smp_mb(); /* paired with opposite smp_mb() in sk poll logic */
> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
> POLLRDNORM | POLLRDBAND);
> sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
> read_unlock(&sk->sk_callback_lock);
> }
>
> As about every incoming packet calls this path, we should be very careful not
> slowing down stack if not necessary.
>
> On x86 this extra smp_mb() is not needed, since previous call to read_lock()
> already gives the full barrier for free.
>
>

Well, I see the __read_lock()+2x smp_mb+__read_unlock is not well suited
for x86. You're right.

But read_lock + smp_mb__after_lock + read_unlock is not well suited for
powerpc, arm, mips and probably others where there is an explicit memory
barrier at the end of the read lock primitive.

One thing that would be efficient for all architectures is to create a
locking primitive that contains the smp_mb, e.g.:

read_lock_smp_mb()

which would act as a read_lock which does a full smp_mb after the lock
is taken.

The naming may be a bit odd, better ideas are welcome.

Mathieu

--
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/