Re: [PATCH net] net/smc: use the correct ndev to find pnetid by pnetid table

From: Guangguan Wang
Date: Fri Jan 10 2025 - 00:44:14 EST




On 2025/1/9 11:04, Halil Pasic wrote:
> On Wed, 8 Jan 2025 12:57:00 +0800
> Guangguan Wang <guangguan.wang@xxxxxxxxxxxxxxxxx> wrote:
>
>>> sorry for chiming in late. Wenjia is on vacation and Jan is out sick!
>>> After some reading and thinking I could not figure out how 890a2cb4a966
>>> ("net/smc: rework pnet table") is broken.
>>
>> Before commit 890a2cb4a966:
>> smc_pnet_find_roce_resource
>> smc_pnet_find_roce_by_pnetid(ndev, ...) /* lookup via hardware-defined pnetid */
>> smc_pnetid_by_dev_port(base_ndev, ...)
>> smc_pnet_find_roce_by_table(ndev, ...) /* lookup via SMC PNET table */
>> {
>> ...
>> list_for_each_entry(pnetelem, &smc_pnettable.pnetlist, list) {
>> if (ndev == pnetelem->ndev) { /* notice here, it was ndev to matching pnetid element in pnet table */
>> ...
>> }
>>
>> After commit 890a2cb4a966:
>> smc_pnet_find_roce_resource
>> smc_pnet_find_roce_by_pnetid
>> {
>> ...
>> base_ndev = pnet_find_base_ndev(ndev); /* rename the variable name to base_ndev for better understanding */
>> smc_pnetid_by_dev_port(base_ndev, ...)
>> smc_pnet_find_ndev_pnetid_by_table(base_ndev, ...)
>> {
>> ...
>> list_for_each_entry(pnetelem, &smc_pnettable.pnetlist, list) {
>> if (base_ndev == pnetelem->ndev) { /* notice here, it is base_ndev to matching pnetid element in pnet table */
>> ...
>> }
>>
>> }
>>
>> The commit 890a2cb4a966 has changed ndev to base_ndev when matching pnetid element in pnet table.
>> But in the function smc_pnet_add_eth, the pnetid is attached to the ndev itself, not the base_ndev.
>> smc_pnet_add_eth(...)
>> {
>> ...
>> ndev = dev_get_by_name(net, eth_name);
>> ...
>> if (new_netdev) {
>> if (ndev) {
>> new_pe->ndev = ndev;
>> netdev_tracker_alloc(ndev, &new_pe->dev_tracker,
>> GFP_ATOMIC);
>> }
>> list_add_tail(&new_pe->list, &pnettable->pnetlist);
>> mutex_unlock(&pnettable->lock);
>> } else {
>> ...
>> }
>
> I still not understand why do you think that 890a2cb4a966~1 is better
> than 890a2cb4a966 even if things changed with 890a2cb4a966 which
> I did not verify for myself but am willing to assume.
>
> Is there some particular setup that you think would benefit from
> you patch? I.e. going back to the 890a2cb4a966~1 behavior I suppose.
>

We want to use SMC in container on cloud environment, and encounter problem
when using smc_pnet with commit 890a2cb4a966. In container, there have choices
of different container network, such as directly using host network, virtual
network IPVLAN, veth, etc. Different choices of container network have different
netdev hierarchy. Examples of netdev hierarchy show below. (eth0 and eth1 in host
below is the netdev directly related to the physical device).
_______________________________ ________________________________
| _________________ | | _________________ |
| |POD | | | |POD __________ | |
| | | | | | |upper_ndev| | |
| | eth0_________ | | | |eth0|__________| | |
| |____| |__| | | |_______|_________| |
| | | | | |lower netdev |
| | | | | __|______ |
| eth1|base_ndev| eth0_______ | | eth1| | eth0_______ |
| | | | RDMA || | |base_ndev| | RDMA ||
| host |_________| |_______|| | host |_________| |_______||
———————————————————————————————- ———————————————————————————————-
netdev hierarchy if directly netdev hierarchy if using IPVLAN
using host network
_______________________________
| _____________________ |
| |POD _________ | |
| | |base_ndev|| |
| |eth0(veth)|_________|| |
| |____________|________| |
| |pairs |
| _______|_ |
| | | eth0_______ |
| veth|base_ndev| | RDMA ||
| |_________| |_______||
| _________ |
| eth1|base_ndev| |
| host |_________| |
———————————————————————————————
netdev hierarchy if using veth

Due to some reasons, the eth1 in host is not RDMA attached netdevice, pnetid
is needed to map the eth1(in host) with RDMA device so that POD can do SMC-R.
Because the eth1(in host) is managed by CNI plugin(such as Terway, network
management plugin in container environment), and in cloud environment the
eth(in host) can dynamically be inserted by CNI when POD create and dynamically
be removed by CNI when POD destroy and no POD related to the eth(in host)
anymore. It is hard for us to config the pnetid to the eth1(in host). So we
config the pnetid to the netdevice which can be seen in POD. When do SMC-R, both
the container directly using host network and the container using veth network
can successfully match the RDMA device, because the configured pnetid netdev is a
base_ndev. But the container using IPVLAN can not successfully match the RDMA
device and 0x03030000 fallback happens, because the configured pnetid netdev is
not a base_ndev. Additionally, if config pnetid to the eth1(in host) also can not
work for matching RDMA device when using veth network and doing SMC-R in POD.

My patch can resolve the problem we encountered and also can unify the pnetid setup
of different network choices list above, assuming the pnetid is not limited to
config to the base_ndev directly related to the physical device(indeed, the current
implementation has not limited it yet).

> I think I showed a valid and practical setup that would break with your
> patch as is. Do you agree with that statement?
Did you mean
"
Now for something like a bond of two OSA
interfaces, I would expect the two legs of the bond to probably have a
"HW PNETID", but the netdev representing the bond itself won't have one
unless the Linux admin defines a software PNETID, which is work, and
can't have a HW PNETID because it is a software construct within Linux.
Breaking for example an active-backup bond setup where the legs have
HW PNETIDs and the admin did not bother to specify a PNETID for the bond
is not acceptable.
" ?
If the legs have HW pnetids, add pnetid to bond netdev will fail as
smc_pnet_add_eth will check whether the base_ndev already have HW pnetid.

If the legs without HW pnetids, and admin add pnetids to legs through smc_pnet.
Yes, my patch will break the setup. What Paolo suggests(both checking ndev and
base_ndev, and replace || by && )can help compatible with the setup.


Thanks,
Guangguan Wang
>
> Regards,
> Halil