Re: [PATCH v2] af_packet: fix soft lockup issue caused by tpacket_snd()

From: luyun
Date: Tue Jul 08 2025 - 09:24:12 EST



在 2025/7/8 15:12, Eric Dumazet 写道:
On Mon, Jul 7, 2025 at 7:06 PM Yun Lu <luyun_611@xxxxxxx> wrote:
From: Yun Lu <luyun@xxxxxxxxxx>

When MSG_DONTWAIT is not set, the tpacket_snd operation will wait for
pending_refcnt to decrement to zero before returning. The pending_refcnt
is decremented by 1 when the skb->destructor function is called,
indicating that the skb has been successfully sent and needs to be
destroyed.

If an error occurs during this process, the tpacket_snd() function will
exit and return error, but pending_refcnt may not yet have decremented to
zero. Assuming the next send operation is executed immediately, but there
are no available frames to be sent in tx_ring (i.e., packet_current_frame
returns NULL), and skb is also NULL, the function will not execute
wait_for_completion_interruptible_timeout() to yield the CPU. Instead, it
will enter a do-while loop, waiting for pending_refcnt to be zero. Even
if the previous skb has completed transmission, the skb->destructor
function can only be invoked in the ksoftirqd thread (assuming NAPI
threading is enabled). When both the ksoftirqd thread and the tpacket_snd
operation happen to run on the same CPU, and the CPU trapped in the
do-while loop without yielding, the ksoftirqd thread will not get
scheduled to run. As a result, pending_refcnt will never be reduced to
zero, and the do-while loop cannot exit, eventually leading to a CPU soft
lockup issue.

In fact, as long as pending_refcnt is not zero, even if skb is NULL,
wait_for_completion_interruptible_timeout() should be executed to yield
the CPU, allowing the ksoftirqd thread to be scheduled. Therefore, the
execution condition of this function should be modified to check if
pending_refcnt is not zero.

Fixes: 89ed5b519004 ("af_packet: Block execution of tasks waiting for transmit to complete in AF_PACKET")
Suggested-by: LongJun Tang <tanglongjun@xxxxxxxxxx>
Signed-off-by: Yun Lu <luyun@xxxxxxxxxx>

---
Changes in v2:
- Add a Fixes tag.
- Link to v1: https://lore.kernel.org/all/20250707081629.10344-1-luyun_611@xxxxxxx/
---
net/packet/af_packet.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 3d43f3eae759..7df96311adb8 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2845,7 +2845,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
ph = packet_current_frame(po, &po->tx_ring,
TP_STATUS_SEND_REQUEST);
if (unlikely(ph == NULL)) {
- if (need_wait && skb) {
+ if (need_wait && packet_read_pending(&po->tx_ring)) {
timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
if (timeo <= 0) {
packet_read_pending() is super expensive on hosts with 256 cpus (or more)

Yeah, the CPU is exactly stuck on packet_read_pending() when soft lockup occurs.



We are going to call it a second time at the end of the block:

do { ...
} while (ph != NULL || (need_wait && packet_read_pending(&po->tx_ring)...

Perhaps we can remove the second one ?

The first call to packet_read_pending() is only needed when skb is NULL (i.e., at the start of the loop),  to determine whether wait_for_completion_interruptible_timeout() should be executed for waiting.

If the first call to packet_read_pending() has already returned 0, then the second call at the end of the block can also be omitted.

So, the code might be modified as shown below:


diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 7df96311adb8..15a37209f872 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,7 +2785,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
        int len_sum = 0;
        int status = TP_STATUS_AVAILABLE;
        int hlen, tlen, copylen = 0;
-       long timeo = 0;
+       long timeo;
+       bool pending = true;

        mutex_lock(&po->pg_vec_lock);

@@ -2839,18 +2840,22 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
        if ((size_max > dev->mtu + reserve + VLAN_HLEN) && !vnet_hdr_sz)
                size_max = dev->mtu + reserve + VLAN_HLEN;

+       timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
        reinit_completion(&po->skb_completion);

        do {
                ph = packet_current_frame(po, &po->tx_ring,
                                          TP_STATUS_SEND_REQUEST);
                if (unlikely(ph == NULL)) {
-                       if (need_wait && packet_read_pending(&po->tx_ring)) {
-                               timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
-                               timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
-                               if (timeo <= 0) {
-                                       err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
-                                       goto out_put;
+                       if (need_wait) {
+                               if (skb == NULL)
+                                       pending = !!packet_read_pending(&po->tx_ring);
+                               if (skb || pending) {
+                                       timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
+                                       if (timeo <= 0) {
+                                               err = !timeo ? -ETIMEDOUT : -ERESTARTSYS;
+                                               goto out_put;
+                                       }
                                }
                        }
                        /* check for additional frames */
@@ -2950,7 +2955,7 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
                 * condition, and luckily don't have to go that path
                 * anyway.
                 */
-                (need_wait && packet_read_pending(&po->tx_ring))));
+                (need_wait && pending && packet_read_pending(&po->tx_ring))));

        err = len_sum;
        goto out_put;

Also I think there is another problem with the code.

We should call sock_sndtimeo() only once, otherwise SO_SNDTIMEO
constraint could be way off.

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index f6b1ff883c9318facdcb9c3112b94f0b6e40d504..486ade64bddfddb1af91968dbdf70015cfb93eb5
100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2785,8 +2785,9 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
int len_sum = 0;
int status = TP_STATUS_AVAILABLE;
int hlen, tlen, copylen = 0;
- long timeo = 0;
+ long timeo;

+ timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT);
mutex_lock(&po->pg_vec_lock);

/* packet_sendmsg() check on tx_ring.pg_vec was lockless,
@@ -2846,7 +2847,6 @@ static int tpacket_snd(struct packet_sock *po,
struct msghdr *msg)
TP_STATUS_SEND_REQUEST);
if (unlikely(ph == NULL)) {
if (need_wait && skb) {
- timeo = sock_sndtimeo(&po->sk,
msg->msg_flags & MSG_DONTWAIT);
timeo =
wait_for_completion_interruptible_timeout(&po->skb_completion, timeo);
if (timeo <= 0) {
err = !timeo ? -ETIMEDOUT :
-ERESTARTSYS;