Re: [PATCH v2 3/3] tg3: Fix tx_pending checks for tg3_tso_bug

From: Prashant Sreedharan
Date: Fri Aug 22 2014 - 00:39:44 EST


On Thu, 2014-08-21 at 14:57 -0700, Benjamin Poirier wrote:
> In tg3_set_ringparam(), the tx_pending test to cover the cases where
> tg3_tso_bug() is entered has two problems
> 1) the check is only done for certain hardware whereas the workaround
> is now used more broadly. IOW, the check may not be performed when it
> is needed.
> 2) the check is too optimistic.
>
> For example, with a 5761 (SHORT_DMA_BUG), tg3_set_ringparam() skips over the
> "tx_pending <= (MAX_SKB_FRAGS * 3)" check because TSO_BUG is false. Even if it
> did do the check, with a full sized skb, frag_cnt_est = 135 but the check is
> for <= MAX_SKB_FRAGS * 3 (= 17 * 3 = 51). So the check is insufficient. This
> leads to the following situation: by setting, ex. tx_pending = 100, there can
> be an skb that triggers tg3_tso_bug() and that is large enough to cause
> tg3_tso_bug() to stop the queue even when it is empty. We then end up with a
> netdev watchdog transmit timeout.
>
> Given that 1) some of the conditions tested for in tg3_tx_frag_set() apply
> regardless of the chipset flags and that 2) it is difficult to estimate ahead
> of time the max possible number of frames that a large skb may be split into
> by gso, we instead take the approach of adjusting dev->gso_max_segs according
> to the requested tx_pending size.
>
> This puts us in the exceptional situation that a single skb that triggers
> tg3_tso_bug() may require the entire tx ring. Usually the tx queue is woken up
> when at least a quarter of it is available (TG3_TX_WAKEUP_THRESH) but that
> would be insufficient now. To avoid useless wakeups, the tx queue wake up
> threshold is made dynamic. Likewise, usually the tx queue is stopped as soon
> as an skb with max frags may overrun it. Since the skbs submitted from
> tg3_tso_bug() use a controlled number of descriptors, the tx queue stop
> threshold may be lowered.
>
> Signed-off-by: Benjamin Poirier <bpoirier@xxxxxxx>
> ---
> Changes v1->v2
> * in tg3_set_ringparam(), reduce gso_max_segs further to budget 3 descriptors
> per gso seg instead of only 1 as in v1
> * in tg3_tso_bug(), check that this estimation (3 desc/seg) holds, otherwise
> linearize some skbs as needed
> * in tg3_start_xmit(), make the queue stop threshold a parameter, for the
> reason explained in the commit description
>
> I was concerned that this last change, because of the extra call in the
> default xmit path, may impact performance so I performed an rr latency test
> but I did not measure a significant impact. That test was with default mtu and
> ring size.
>
> # perf stat -r10 -ad netperf -H 192.168.9.30 -l60 -T 0,0 -t omni -- -d rr
>
> * without patches
> rr values: 7039.63 6865.03 6939.21 6919.31 6931.88 6932.74 6925.1 6953.33 6868.43 6935.65
> sample size: 10
> mean: 6931.031
> standard deviation: 48.10918
> quantiles: 6865.03 6920.757 6932.31 6938.32 7039.63
> 6930Â50
>
> Performance counter stats for 'netperf -H 192.168.9.30 -l60 -T 0,0 -t omni -- -d rr' (10 runs):
>
> 480643.024723 task-clock # 8.001 CPUs utilized ( +- 0.00% ) [100.00%]
> 855,136 context-switches # 0.002 M/sec ( +- 0.23% ) [100.00%]
> 521 CPU-migrations # 0.000 M/sec ( +- 6.49% ) [100.00%]
> 104 page-faults # 0.000 M/sec ( +- 2.73% )
> 298,416,906,437 cycles # 0.621 GHz ( +- 4.08% ) [15.01%]
> 812,072,320,370 stalled-cycles-frontend # 272.13% frontend cycles idle ( +- 1.89% ) [25.01%]
> 685,633,562,247 stalled-cycles-backend # 229.76% backend cycles idle ( +- 2.50% ) [35.00%]
> 117,665,891,888 instructions # 0.39 insns per cycle
> # 6.90 stalled cycles per insn ( +- 2.22% ) [45.00%]
> 26,158,399,505 branches # 54.424 M/sec ( +- 2.10% ) [50.00%]
> 205,688,614 branch-misses # 0.79% of all branches ( +- 0.78% ) [50.00%]
> 27,882,474,171 L1-dcache-loads # 58.011 M/sec ( +- 1.98% ) [50.00%]
> 369,911,372 L1-dcache-load-misses # 1.33% of all L1-dcache hits ( +- 0.62% ) [50.00%]
> 76,240,847 LLC-loads # 0.159 M/sec ( +- 1.04% ) [40.00%]
> 3,220 LLC-load-misses # 0.00% of all LL-cache hits ( +- 19.49% ) [ 5.00%]
>
> 60.074059340 seconds time elapsed ( +- 0.00% )
>
> * with patches
> rr values: 6732.65 6920.1 6909.46 7032.41 6864.43 6897.6 6815.19 6967.83 6849.23 6929.52
> sample size: 10
> mean: 6891.842
> standard deviation: 82.91901
> quantiles: 6732.65 6853.03 6903.53 6927.165 7032.41
> 6890Â80
>
> Performance counter stats for 'netperf -H 192.168.9.30 -l60 -T 0,0 -t omni -- -d rr' (10 runs):
>
> 480675.949728 task-clock # 8.001 CPUs utilized ( +- 0.01% ) [100.00%]
> 850,461 context-switches # 0.002 M/sec ( +- 0.37% ) [100.00%]
> 564 CPU-migrations # 0.000 M/sec ( +- 5.67% ) [100.00%]
> 417 page-faults # 0.000 M/sec ( +- 76.04% )
> 287,019,442,295 cycles # 0.597 GHz ( +- 7.16% ) [15.01%]
> 828,198,830,689 stalled-cycles-frontend # 288.55% frontend cycles idle ( +- 3.01% ) [25.01%]
> 718,230,307,166 stalled-cycles-backend # 250.24% backend cycles idle ( +- 3.53% ) [35.00%]
> 117,976,598,188 instructions # 0.41 insns per cycle
> # 7.02 stalled cycles per insn ( +- 4.06% ) [45.00%]
> 26,715,853,108 branches # 55.580 M/sec ( +- 3.77% ) [50.00%]
> 198,787,673 branch-misses # 0.74% of all branches ( +- 0.86% ) [50.00%]
> 28,416,922,166 L1-dcache-loads # 59.119 M/sec ( +- 3.54% ) [50.00%]
> 367,613,007 L1-dcache-load-misses # 1.29% of all L1-dcache hits ( +- 0.47% ) [50.00%]
> 75,260,575 LLC-loads # 0.157 M/sec ( +- 2.24% ) [40.00%]
> 5,777 LLC-load-misses # 0.01% of all LL-cache hits ( +- 36.03% ) [ 5.00%]
>
> 60.077898757 seconds time elapsed ( +- 0.01% )
>
> I reproduced this bug using the same approach explained in patch 1.
> The bug reproduces with tx_pending <= 135
>
> ---
> drivers/net/ethernet/broadcom/tg3.c | 67 +++++++++++++++++++++++++++++--------
> drivers/net/ethernet/broadcom/tg3.h | 1 +
> 2 files changed, 54 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 0cecd6d..c29f2e3 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -204,6 +204,10 @@ static inline void _tg3_flag_clear(enum TG3_FLAGS flag, unsigned long *bits)
> /* minimum number of free TX descriptors required to wake up TX process */
> #define TG3_TX_WAKEUP_THRESH(tnapi) max_t(u32, (tnapi)->tx_pending / 4, \
> MAX_SKB_FRAGS + 1)
> +/* estimate a certain number of descriptors per gso segment */
> +#define TG3_TX_DESC_PER_SEG(seg_nb) ((seg_nb) * 3)
> +#define TG3_TX_SEG_PER_DESC(desc_nb) ((desc_nb) / 3)
> +
> #define TG3_TX_BD_DMA_MAX_2K 2048
> #define TG3_TX_BD_DMA_MAX_4K 4096
>
> @@ -6609,10 +6613,10 @@ static void tg3_tx(struct tg3_napi *tnapi)
> smp_mb();
>
> if (unlikely(netif_tx_queue_stopped(txq) &&
> - (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi)))) {
> + (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh))) {
> __netif_tx_lock(txq, smp_processor_id());
> if (netif_tx_queue_stopped(txq) &&
> - (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi)))
> + (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh))
> netif_tx_wake_queue(txq);
> __netif_tx_unlock(txq);
> }
> @@ -7830,6 +7834,8 @@ static int tigon3_dma_hwbug_workaround(struct tg3_napi *tnapi,
> }
>
> static netdev_tx_t tg3_start_xmit(struct sk_buff *, struct net_device *);
> +static netdev_tx_t __tg3_start_xmit(struct sk_buff *, struct net_device *,
> + u32);
>
> /* Use GSO to workaround all TSO packets that meet HW bug conditions
> * indicated in tg3_tx_frag_set()
> @@ -7838,11 +7844,13 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
> struct netdev_queue *txq, struct sk_buff *skb)
> {
> struct sk_buff *segs, *nskb;
> - u32 frag_cnt_est = skb_shinfo(skb)->gso_segs * 3;
> + unsigned int segs_remaining = skb_shinfo(skb)->gso_segs;
> + u32 desc_cnt_est = TG3_TX_DESC_PER_SEG(segs_remaining);
>
> - /* Estimate the number of fragments in the worst case */
> - if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
> + if (unlikely(tg3_tx_avail(tnapi) <= desc_cnt_est)) {
> netif_tx_stop_queue(txq);
> + tnapi->wakeup_thresh = desc_cnt_est;
> + BUG_ON(tnapi->wakeup_thresh >= tnapi->tx_pending);
>
> /* netif_tx_stop_queue() must be done before checking
> * checking tx index in tg3_tx_avail() below, because in
> @@ -7850,7 +7858,7 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
> * netif_tx_queue_stopped().
> */
> smp_mb();
> - if (tg3_tx_avail(tnapi) <= frag_cnt_est)
> + if (tg3_tx_avail(tnapi) <= tnapi->wakeup_thresh)
> return NETDEV_TX_BUSY;
>
> netif_tx_wake_queue(txq);
> @@ -7858,14 +7866,33 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
>
> segs = skb_gso_segment(skb, tp->dev->features &
> ~(NETIF_F_TSO | NETIF_F_TSO6));
> - if (IS_ERR(segs) || !segs)
> + if (IS_ERR_OR_NULL(segs))
> goto tg3_tso_bug_end;
>
> do {
> + unsigned int desc_cnt = skb_shinfo(segs)->nr_frags + 1;
> +
> nskb = segs;
> segs = segs->next;
> nskb->next = NULL;
> - tg3_start_xmit(nskb, tp->dev);
> +
> + if (tg3_tx_avail(tnapi) <= segs_remaining - 1 + desc_cnt &&
> + skb_linearize(nskb)) {
> + nskb->next = segs;
> + segs = nskb;
> + do {
> + nskb = segs->next;
> +
> + dev_kfree_skb_any(segs);
> + segs = nskb;
> + } while (segs);
If skb_linearize() fails need to increment the tp->tx_dropped count
> + goto tg3_tso_bug_end;
> + }
> + segs_remaining--;
> + if (segs_remaining)
> + __tg3_start_xmit(nskb, tp->dev, segs_remaining);
To clarify passing segs_remaining will make sure the queue is never
stopped correct ?
> + else
> + tg3_start_xmit(nskb, tp->dev);
> } while (segs);
>
> tg3_tso_bug_end:
> @@ -7877,6 +7904,12 @@ tg3_tso_bug_end:
> /* hard_start_xmit for all devices */
> static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> {
> + return __tg3_start_xmit(skb, dev, MAX_SKB_FRAGS + 1);
> +}
> +
> +static netdev_tx_t __tg3_start_xmit(struct sk_buff *skb,
> + struct net_device *dev, u32 stop_thresh)
> +{
> struct tg3 *tp = netdev_priv(dev);
> u32 len, entry, base_flags, mss, vlan = 0;
> u32 budget;
> @@ -7905,12 +7938,17 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> if (unlikely(budget <= (skb_shinfo(skb)->nr_frags + 1))) {
> if (!netif_tx_queue_stopped(txq)) {
> netif_tx_stop_queue(txq);
> + tnapi->wakeup_thresh = TG3_TX_WAKEUP_THRESH(tnapi);
>
> /* This is a hard error, log it. */
> netdev_err(dev,
> "BUG! Tx Ring full when queue awake!\n");
> }
> - return NETDEV_TX_BUSY;
> + smp_mb();
> + if (tg3_tx_avail(tnapi) <= tnapi->wakeup_thresh)
> + return NETDEV_TX_BUSY;
> +
> + netif_tx_wake_queue(txq);
> }
>
> entry = tnapi->tx_prod;
> @@ -8087,8 +8125,9 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> tw32_tx_mbox(tnapi->prodmbox, entry);
>
> tnapi->tx_prod = entry;
> - if (unlikely(tg3_tx_avail(tnapi) <= (MAX_SKB_FRAGS + 1))) {
> + if (unlikely(tg3_tx_avail(tnapi) <= stop_thresh)) {
> netif_tx_stop_queue(txq);
> + tnapi->wakeup_thresh = TG3_TX_WAKEUP_THRESH(tnapi);
>
> /* netif_tx_stop_queue() must be done before checking
> * checking tx index in tg3_tx_avail() below, because in
> @@ -8096,7 +8135,7 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
> * netif_tx_queue_stopped().
> */
> smp_mb();
> - if (tg3_tx_avail(tnapi) > TG3_TX_WAKEUP_THRESH(tnapi))
> + if (tg3_tx_avail(tnapi) > tnapi->wakeup_thresh)
> netif_tx_wake_queue(txq);
> }
>
> @@ -12319,9 +12358,7 @@ static int tg3_set_ringparam(struct net_device *dev, struct ethtool_ringparam *e
> if ((ering->rx_pending > tp->rx_std_ring_mask) ||
> (ering->rx_jumbo_pending > tp->rx_jmb_ring_mask) ||
> (ering->tx_pending > TG3_TX_RING_SIZE - 1) ||
> - (ering->tx_pending <= MAX_SKB_FRAGS + 1) ||
> - (tg3_flag(tp, TSO_BUG) &&
> - (ering->tx_pending <= (MAX_SKB_FRAGS * 3))))
> + (ering->tx_pending <= MAX_SKB_FRAGS + 1))
> return -EINVAL;
>
> if (netif_running(dev)) {
> @@ -12341,6 +12378,7 @@ static int tg3_set_ringparam(struct net_device *dev, struct ethtool_ringparam *e
> if (tg3_flag(tp, JUMBO_RING_ENABLE))
> tp->rx_jumbo_pending = ering->rx_jumbo_pending;
>
> + dev->gso_max_segs = TG3_TX_SEG_PER_DESC(ering->tx_pending - 1);
> for (i = 0; i < tp->irq_max; i++)
> tp->napi[i].tx_pending = ering->tx_pending;
>
> @@ -17817,6 +17855,7 @@ static int tg3_init_one(struct pci_dev *pdev,
> else
> sndmbx += 0xc;
> }
> + dev->gso_max_segs = TG3_TX_SEG_PER_DESC(TG3_DEF_TX_RING_PENDING - 1);
>
> tg3_init_coal(tp);
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.h b/drivers/net/ethernet/broadcom/tg3.h
> index 461acca..6a7e13d 100644
> --- a/drivers/net/ethernet/broadcom/tg3.h
> +++ b/drivers/net/ethernet/broadcom/tg3.h
> @@ -3006,6 +3006,7 @@ struct tg3_napi {
> u32 tx_pending;
> u32 last_tx_cons;
> u32 prodmbox;
> + u32 wakeup_thresh;
> struct tg3_tx_buffer_desc *tx_ring;
> struct tg3_tx_ring_info *tx_buffers;
>


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/