Re: [PATCH net-next] netdevsim: Register and unregister devlink traps on probe/remove device

From: Leon Romanovsky
Date: Tue Oct 26 2021 - 12:15:06 EST


On Tue, Oct 26, 2021 at 05:09:47PM +0300, Ido Schimmel wrote:
> On Tue, Oct 26, 2021 at 10:18:12AM +0300, Leon Romanovsky wrote:
> > On Tue, Oct 26, 2021 at 09:51:13AM +0300, Ido Schimmel wrote:
> >
> > <...>
> >
> > > >
> > > > Can you please explain why is it so important to touch devlink SW
> > > > objects, reallocate them again and again on every reload in mlxsw?
> > >
> > > Because that's how reload was defined and implemented. A complete
> > > reload. We are not changing the semantics 4 years later.
> >
> > Please put your emotions aside and explain me technically why are you
> > must to do it?
>
> Already did. The current semantics are "devlink-reload provides
> mechanism to reinit driver entities, applying devlink-params and
> devlink-resources new values. It also provides mechanism to activate
> firmware."

Right, it doesn't mean that devlink should reregister itself.

>
> And this is exactly what netdevsim and mlxsw are doing. Driver entities
> are re-initialized. Your patch breaks that as entities are not
> re-initialized, which results in user space breakage. You simply cannot
> introduce such regressions.

Again, and again. I don't want to introduce regression, and I'll fix it.
However, let's try to reach a conclusion on how to fix the current regression
properly.

>
> >
> > The proposed semantics was broken for last 4 years, it can even seen as
> > dead on arrival,
>
> Again with the bombastic statements. It was "dead on arrival" like the
> notifications were "impossible"?

No, it was dead-on-arrival, because of deadlocks and kernel panics.

My search in internal bug tracker shows more than 500 bugs opened
by various teams where devlink reload is involved. It is hard to tell
which are pure devlink reload related, but when I started to work on
this series, close to 20 such bugs were assigned to me.

>
> > because it never worked for us in real production.
>
> Who is "us"? mlx5 that apparently decided to do its own thing?

Yes, mlx5. I don't see why you think that us not calling devlink
recursively means "own thing".

>
> We are using reload in mlxsw on a daily basis and users are using it to
> re-partition ASIC resources and activate firmware.

Are you really compare mlx5 deployment scale and broad of use with mlxsw?

> There are tests over netdevsim implementation that anyone can run for
> testing purposes. We also made sure to integrate it into syzkaller:
>
> https://github.com/google/syzkaller/commit/5b49e1f605a770e8f8fcdcbd1a8ff85591fc0c8e
> https://github.com/google/syzkaller/commit/04ca72cd45348daab9d896bbec8ea4c2d13455ac
> https://github.com/google/syzkaller/commit/6930bbef3b671ae21f74007f9e59efb9b236b93f
> https://github.com/google/syzkaller/commit/d45a4d69d83f40579e74fb561e1583db1be0e294
> https://github.com/google/syzkaller/commit/510951950dc0ee69cfdaf746061d3dbe31b49fd8
>
> Which is why the regressions you introduced were discovered so quickly.

No, it was caught because I explicitly added assertions to find misuse.

>
> >
> > So I'm fixing bugs without relation to when they were introduced.
>
> We all do

Great

>
> >
> > For example, this fix from Jiri [1] for basic design flow was merged almost
> > two years later after devlink reload was introduced [2], or this patch from
> > Parav [3] that fixed an issue introduced year before [4].
>
> What is your point? That code has bugs?

My point is that devlink should be decoupled from mlxsw, so mlxsw won't
call recursively to devlink when executes devlink API.

This is incorrect and for me it is a bug.

>
> By now I have spent more time arguing with you than you spent testing
> your patches and it's clear this discussion is not going anywhere.
>
> Are you going to send a revert or I will? This is the fourth time I'm
> asking you.

I understand your temptation to send revert, at the end it is the
easiest solution. However, I prefer to finish this discussion with
decision on how the end result in mlxsw will look like.

Let's hear Jiri and Jakub before we are rushing to revert something that
is correct in my opinion. We have whole week till merge window, and
revert takes less than 5 minutes, so no need to rush and do it before
direction is clear.

Thanks