Re: [PATCH net-next v3 6/6] net/ncsi: Configure multi-package, multi-channel modes with failover

From: Samuel Mendoza-Jonas
Date: Thu Nov 08 2018 - 21:17:37 EST


On Thu, 2018-11-08 at 22:48 +0000, Justin.Lee1@xxxxxxxx wrote:
> Hi Samuel,
>
> For multi-package and multi-channel case, channel seems to be select correctly. Expect that,
> I still see the timing issue for back-to-back netlink command. Due to that, channel might be
> set to invisible state. Please refer to ncsi0 and ncsi2 below. The channel state is set to 3.
>
> cat /sys/kernel/debug/ncsi_protocol/ncsi_device_status
> IFIDX IFNAME NAME PID CID RX TX MP MC WP WC PC CS PS LS RU CR NQ HA
> =====================================================================
> 2 eth2 ncsi0 000 000 1 1 1 1 1 0 0 3 0 1 1 1 0 1
> 2 eth2 ncsi1 000 001 0 0 1 1 1 0 0 1 0 1 1 1 0 1
> 2 eth2 ncsi2 001 000 1 0 1 1 1 1 0 3 0 1 1 1 0 1
> 2 eth2 ncsi3 001 001 1 0 1 1 1 1 0 2 1 1 1 1 0 1
> =====================================================================
> MP: Multi-mode Package WP: Whitelist Package
> MC: Multi-mode Channel WC: Whitelist Channel
> PC: Primary Channel CS: Channel State IA/A/IV 1/2/3
> PS: Poll Status LS: Link Status
> RU: Running CR: Carrier OK
> NQ: Queue Stopped HA: Hardware Arbitration
>
> The timing issue is not only happening in application. If I use using the following way
> to send the request, I can see the issue as well.
>
> ncsi_netlink -l 2 -a 0x01 -m; ncsi_netlink -l 2 -p 0 -b 0x03 -m; ncsi_netlink -l 2 -p 1 -b 0x00 -m;
> ncsi_netlink -l 2 -a 0x03 -m; ncsi_netlink -l 2 -p 0 -b 0x00 -m; ncsi_netlink -l 2 -p 1 -b 0x03 -m;

This actually recreates for me as well; I see now what you mean about
channels getting stuck in the invisible state. I believe I've narrowed
down the issue. I've pasted an additional patch below if you are able to
test on your machine.

>
>
> Also, there is one issue below for non-multi-package/non-multi-channel case.
>
> Thanks,
> Justin
>
>
> > @@ -1008,32 +1164,49 @@ static int ncsi_choose_active_channel(struct ncsi_dev_priv *ndp)
> >
> > ncm = &nc->modes[NCSI_MODE_LINK];
> > if (ncm->data[2] & 0x1) {
> > - spin_unlock_irqrestore(&nc->lock, flags);
> > found = nc;
> > - goto out;
> > + with_link = true;
> > }
> >
> > - spin_unlock_irqrestore(&nc->lock, flags);
> > + /* If multi_channel is enabled configure all valid
> > + * channels whether or not they currently have link
> > + * so they will have AENs enabled.
> > + */
> > + if (with_link || np->multi_channel) {
> > + spin_lock_irqsave(&ndp->lock, flags);
> > + list_add_tail_rcu(&nc->link,
> > + &ndp->channel_queue);
> > + spin_unlock_irqrestore(&ndp->lock, flags);
> > +
> > + netdev_dbg(ndp->ndev.dev,
> > + "NCSI: Channel %u added to queue (link %s)\n",
> > + nc->id,
> > + ncm->data[2] & 0x1 ? "up" : "down");
> > + }
> > +
> > + spin_unlock_irqrestore(&nc->lock, cflags);
> > +
> > + if (with_link && !np->multi_channel)
> > + break;
>
> The line needs to change to "goto found". If not, all channels with link will be added
> even if the multi-channel is not enabled for that package. The ncsi1 below is enabled.
> There is no netlink command sent to enable multi-package or multi-channel.
>
> IFIDX IFNAME NAME PID CID RX TX MP MC WP WC PC CS PS LS RU CR NQ HA
> =====================================================================
> 2 eth2 ncsi0 000 000 1 1 0 0 1 1 0 2 1 1 1 1 0 1
> 2 eth2 ncsi1 000 001 1 0 0 0 1 1 0 2 1 1 1 1 0 1
> 2 eth2 ncsi2 001 000 0 0 0 0 1 1 0 1 0 1 1 1 0 1
> 2 eth2 ncsi3 001 001 0 0 0 0 1 1 0 1 0 1 1 1 0 1
> =====================================================================
> MP: Multi-mode Package WP: Whitelist Package
> MC: Multi-mode Channel WC: Whitelist Channel
> PC: Primary Channel CS: Channel State IA/A/IV 1/2/3
> PS: Poll Status LS: Link Status
> RU: Running CR: Carrier OK
> NQ: Queue Stopped HA: Hardware Arbitration
>
> > }
> > + if (with_link && !ndp->multi_package)
> > + break;
> > }
>
> found:

This *may* be part of the above issue, I don't see this in normal
operation. The combination of (with_link && !np->multi_channel) and
(with_link && !ndp->multi_package) should prevent additional channels
being added without the need for 'goto found'. Please let me know if you
still see it with the extra patch.

>
> After applying this change, I notice that if there is no link available to BMC when BMC
> starts, NC-SI can't properly configure channel once I plug in the Ethernet cable.
>
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_aen_handler_lsc() - pkg 0 ch 0 state up
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_aen_handler_lsc() - had_link 0, has_link 1, chained 0
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_stop_channel_monitor() - pkg 0 ch 0
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel() - pkg 0 ch 0 INVISIBLE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel() - suspending pkg 0 ch 0
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0400 select
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0403 dc
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0404 deselect
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0405 done
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_rsp_handler_dp() - pkg 0 ch 0 INACTIVE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_rsp_handler_dp() - pkg 0 ch 1 INACTIVE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_dev_work()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 state 0406 deselect
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_suspend_channel() - pkg 0 ch 0 INACTIVE
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel()
> npcm7xx-emc f0825000.eth eth2: NCSI: ncsi_process_next_channel() - No more channels to process
> npcm7xx-emc f0825000.eth eth2: NCSI interface down

Good find, there was a corner case in the LSC AEN handler changes that
led to this, I've fixed this in the patch as well. Thanks for testing!