Re: [RFC] tcp: race in receive part

From: Jiri Olsa
Date: Wed Jun 24 2009 - 06:20:57 EST


On Tue, Jun 23, 2009 at 12:32:10PM +0200, Eric Dumazet wrote:
> Jiri Olsa a Ãcrit :
> > Hi,
> >
> > thanks for an answer, and sorry for my late reply,
> > we needed the cust permission to disclose the debug data.
> >
>
> I see ! Now this is me with litle time as I am traveling right now.
>
> >
> > On Thu, Jun 18, 2009 at 04:06:42PM +0200, Eric Dumazet wrote:
> >> Jiri Olsa a Ãcrit :
> >>> Hi,
> >>>
> >>> in RHEL4 we can see a race in the tcp layer. We were not able to reproduce
> >>> this on the upstream kernel, but since the issue occurs very rarelly
> >>> (once per 8 days), we just might not be lucky.
> >>>
> >>> I'm affraid this might be a long email, I'll try to structure it nicely.. :)
> >>>
> >> Thanks for your mail and detailed analysis
> >>
> >>>
> >>> RACE DESCRIPTION
> >>> ================
> >>>
> >>> There's a nice pdf describing the issue (and sollution using locks) on
> >>> https://bugzilla.redhat.com/attachment.cgi?id=345014
> >> I could not reach this url unfortunatly
> >>
> >> --> "You are not authorized to access bug #494404. "
> >
> > please try it now, the bug should be accessible now
> >
>
> Thanks, this doc is indeed nice :)
>
> But adding an write_lock()/write_unlock() in tcp_poll() was overkill
> It had an sm_mb() implied because of the nesting of locks.
>
> >>>
> >>> The race fires, when following code paths meet, and the tp->rcv_nxt and
> >>> __add_wait_queue updates stay in CPU caches.
> >>>
> >>> CPU1 CPU2
> >>>
> >>>
> >>> sys_select receive packet
> >>> ... ...
> >>> __add_wait_queue update tp->rcv_nxt
> >>> ... ...
> >>> tp->rcv_nxt check sock_def_readable
> >>> ... {
> >>> schedule ...
> >>> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> >>> wake_up_interruptible(sk->sk_sleep)
> >>> ...
> >>> }
> >>>
> >>> If there were no cache the code would work ok, since the wait_queue and
> >>> rcv_nxt are opposit to each other.
> >>>
> >>> Meaning that once tp->rcv_nxt is updated by CPU2, the CPU1 either already
> >>> passed the tp->rcv_nxt check and sleeps, or will get the new value for
> >>> tp->rcv_nxt and will return with new data mask.
> >>> In both cases the process (CPU1) is being added to the wait queue, so the
> >>> waitqueue_active (CPU2) call cannot miss and will wake up CPU1.
> >>>
> >>> The bad case is when the __add_wait_queue changes done by CPU1 stay in its
> >>> cache , and so does the tp->rcv_nxt update on CPU2 side. The CPU1 will then
> >>> endup calling schedule and sleep forever if there are no more data on the
> >>> socket.
> >>>
> >>> Adding smp_mb() calls before sock_def_readable call and after __add_wait_queue
> >>> should prevent the above bad scenario.
> >>>
> >>> The upstream patch is attached. It seems to prevent the issue.
> >>>
> >>>
> >>>
> >>> CPU BUGS
> >>> ========
> >>>
> >>> The customer has been able to reproduce this problem only on one CPU model:
> >>> Xeon E5345*2. They didn't reproduce on XEON MV, for example.
> >> Is there an easy way to reproduce the problem ?
> >
> > there's a reproducer attached to the bug
> >
> > https://enterprise.redhat.com/issue-tracker/?module=download&fid=201560&key=f6f87caf6ac2dc1eb1173257c8a5ef78
> >
> > it is basically the client/server program.
> > They're passing messages to each other. When a message is sent,
> > both of them expect message on the input before sending another message.
> >
> > Very rarely the code hits the place when the process which called select
> > is not woken up by incomming data. Probably because of the memory cache
> > incoherency. See backtrace in the
> >
> > https://bugzilla.redhat.com/show_bug.cgi?id=494404#c1
> >
> >
> >>> That CPU model happens to have 2 possible issues, that might cause the issue:
> >>> (see errata http://www.intel.com/Assets/PDF/specupdate/315338.pdf)
> >>>
> >>> AJ39 and AJ18. The first one can be workarounded by BIOS upgrade,
> >>> the other one has following notes:
> >> AJ18 only matters on unaligned accesses, tcp code doesnt do this.
> >>
> >>> Software should ensure at least one of the following is true when
> >>> modifying shared data by multiple agents:
> >>> â The shared data is aligned
> >>> â Proper semaphores or barriers are used in order to
> >>> prevent concurrent data accesses.
> >>>
> >>>
> >>>
> >>> RFC
> >>> ===
> >>>
> >>> I'm aware that not having this issue reproduced on upstream lowers the odds
> >>> having this checked in. However AFAICS the issue is present. I'd appreciate
> >>> any comment/ideas.
> >>>
> >>>
> >>> thanks,
> >>> jirka
> >>>
> >>>
> >>> Signed-off-by: Jiri Olsa <jolsa@xxxxxxxxxx>
> >>>
> >>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> >>> index 17b89c5..f5d9dbf 100644
> >>> --- a/net/ipv4/tcp.c
> >>> +++ b/net/ipv4/tcp.c
> >>> @@ -340,6 +340,11 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
> >>> struct tcp_sock *tp = tcp_sk(sk);
> >>>
> >>> poll_wait(file, sk->sk_sleep, wait);
> >> poll_wait() calls add_wait_queue() which contains a
> >> spin_lock_irqsave()/spin_unlock_irqrestore() pair
> >>
> >> Documentation/memory-barriers.txt states in line 1123 :
> >>
> >> Memory operations issued after the LOCK will be completed after the LOCK
> >> operation has completed.
> >>
> >> and line 1131 states :
> >>
> >> Memory operations issued before the UNLOCK will be completed before the
> >> UNLOCK operation has completed.
> >>
> >> So yes, there is no full smp_mb() in poll_wait()
> >>
> >>> +
> >>> + /* Get in sync with tcp_data_queue, tcp_urg
> >>> + and tcp_rcv_established function. */
> >>> + smp_mb();
> >> If this barrier is really necessary, I guess it should be done in poll_wait() itself.
> >>
> >> Documentation/memory-barriers.txt misses some information about poll_wait()
> >>
> >>
> >>
> >>
> >>> +
> >>> if (sk->sk_state == TCP_LISTEN)
> >>> return inet_csk_listen_poll(sk);
> >>>
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index 2bdb0da..0606e5e 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -4362,8 +4362,11 @@ queue_and_out:
> >>>
> >>> if (eaten > 0)
> >>> __kfree_skb(skb);
> >>> - else if (!sock_flag(sk, SOCK_DEAD))
> >>> + else if (!sock_flag(sk, SOCK_DEAD)) {
> >>> + /* Get in sync with tcp_poll function. */
> >>> + smp_mb();
> >>> sk->sk_data_ready(sk, 0);
> >>> + }
> >>> return;
> >>>
> >> Oh well... if smp_mb() is needed, I believe it should be done
> >> right before "if (waitqueue_active(sk->sk_sleep) ... "
> >>
> >> read_lock(&sk->sk_callback_lock);
> >> + smp_mb();
> >> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> >> wake_up_interruptible(sk->sk_sleep)
> >>
> >> It would match other parts in kernel (see fs/splice.c, fs/aio.c, ...)
> >>
> >> Strange thing is that read_lock() on x86 is a full memory barrier, as it uses
> >> "lock subl $0x1,(%eax)"
> >>
> >> Maybe we could define a smp_mb_after_read_lock() (a compiler barrier() on x86)
> >>
> >
> > First version of the patch was actually in this layer, see
> > https://bugzilla.redhat.com/attachment.cgi?id=345886
> >
> > I was adviced this could be to invasive (it was in waitqueue_active actually),
> > so I moved the change to the TCP layer itself...
> >
> > As far as I understand the problem there's need for 2 barriers to be
> > sure, the memory will have correct data. One in the codepath calling the
> > select (tcp_poll), and in the other one updating the available data status
> > (sock_def_readable), am I missing smth?
> >
>
> Hmm, I am not saying your patch doesnt fix the problem, I am saying it
> is a partial fix of a general problem. We might have same problem(s) in other
> parts of network stack. This is a very serious issue.
>
> Point 1 :
>
> You added an smp_mb() call in tcp_poll(). This one looks fine to solve
> the problem for tcp sockets. What about other protocols ? Do we have
> same problem ?

Looks like most of the protocols using the poll_wait. Also I assume
that most of them will probably have the same scenario as the one
described (CPU1 and CPU2 codepaths in the RACE DESCRIPTION).

So I moved the poll smp_mb() call to the __pollwait function, plz
check the attached diff. This might be too invasive, so another
place could be probably polling callbacks themselfs like
datagram_poll (used very often by protocols), tcp_poll, udp_poll...

I'm still looking which way would be more suitable, comments are very
welcome :)

>
> Point 2 :
>
> You added several smp_mb() calls in tcp input path. In my first
> reply, I said it was probably better to add smp_mb() in a single
> place, right before "if (waitqueue_active(sk->sk_sleep) ... ",
> but in all paths (input path & output path).
>
> Point 3 :
>
> The optimization we could do, defining
> a smp_mb_after_read_lock() that could be a nop on x86
>
> read_lock(&sk->sk_callback_lock); // "lock subl $0x1,(%eax)" on x86
> smp_mb_after_read_lock(); /* compiler barrier() on x86 */
> if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
> wake_up_interruptible(sk->sk_sleep);
>
> Am I missing something ?
>
> ;)
>

not at all :) I'm the one behind..

Anyway I made modifications based on Point 2) and 3) and the diff is
attached, please check.

thanks a lot,
jirka


diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index b7e5db8..570c0ff 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -302,4 +302,7 @@ static inline void __raw_write_unlock(raw_rwlock_t *rw)
#define _raw_read_relax(lock) cpu_relax()
#define _raw_write_relax(lock) cpu_relax()

+/* The read_lock() on x86 is a full memory barrier. */
+#define smp_mb__after_read_lock() barrier()
+
#endif /* _ASM_X86_SPINLOCK_H */
diff --git a/fs/select.c b/fs/select.c
index d870237..f9ebd45 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -219,6 +219,10 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
init_waitqueue_func_entry(&entry->wait, pollwake);
entry->wait.private = pwq;
add_wait_queue(wait_address, &entry->wait);
+
+ /* This memory barrier is paired with the smp_mb__after_read_lock
+ * in the sock_def_readable. */
+ smp_mb();
}

int poll_schedule_timeout(struct poll_wqueues *pwq, int state,
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 252b245..dd28726 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -132,6 +132,11 @@ do { \
#endif /*__raw_spin_is_contended*/
#endif

+/* The read_lock does not imply full memory barrier. */
+#ifndef smp_mb__after_read_lock
+#define smp_mb__after_read_lock() smp_mb()
+#endif
+
/**
* spin_unlock_wait - wait until the spinlock gets unlocked
* @lock: the spinlock in question.
diff --git a/net/core/sock.c b/net/core/sock.c
index b0ba569..11e414f 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1732,6 +1732,7 @@ static void sock_def_error_report(struct sock *sk)
static void sock_def_readable(struct sock *sk, int len)
{
read_lock(&sk->sk_callback_lock);
+ smp_mb__after_read_lock();
if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
POLLRDNORM | POLLRDBAND);
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/