From: Tejun Heo
Date: Sat Oct 19 2019 - 12:10:57 EST
sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user. The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.
Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.
 RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
In , tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule(). It already calculated how many bytes can
be fit into current->page_frag. Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path. Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress. However, this
confused sk_page_frag() called from . Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in .
After  used current->page_frag, the offset would be increased by
the used amount. When the control returns to ,
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.
Fix it by updating sk_page_frag() to test __GFP_MEMALLOC and not use
current->task_frag if set.
Signed-off-by: Tejun Heo <tj@xxxxxxxxxx>
Cc: Josef Bacik <josef@xxxxxxxxxxxxxx>
include/net/sock.h | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 2c53f1a1d905..4e2ca38acc3c 100644
@@ -2233,12 +2233,21 @@ struct sk_buff *sk_stream_alloc_skb(struct sock *sk, int size, gfp_t gfp,
* sk_page_frag - return an appropriate page_frag
* @sk: socket
- * If socket allocation mode allows current thread to sleep, it means its
- * safe to use the per task page_frag instead of the per socket one.
+ * Use the per task page_frag instead of the per socket one for
+ * optimization when we know there can be no other users.
+ * 1. The socket allocation mode allows current thread to sleep. This is
+ * the sleepable context which owns the task page_frag.
+ * 2. The socket allocation mode doesn't indicate that the socket is being
+ * used to reclaim memory. Memory reclaim may nest inside other socket
+ * operations and end up recursing into sk_page_frag() while it's
+ * already in use.
static inline struct page_frag *sk_page_frag(struct sock *sk)
- if (gfpflags_allow_blocking(sk->sk_allocation))
+ if (gfpflags_allow_blocking(sk->sk_allocation) &&
+ !(sk->sk_allocation & __GFP_MEMALLOC))