Re: [PATCH v1] perf evlist: Force adding default events only to core PMUs

From: Ian Rogers
Date: Tue May 28 2024 - 17:37:37 EST


On Tue, May 28, 2024 at 1:33 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Tue, 28 May 2024 at 13:03, Ian Rogers <irogers@xxxxxxxxxx> wrote:
> >
> > But you've traded a fix for one set of users with a fix for another.
>
> No.
>
> You don't seem to understand what "regression" means.
>
> The "no regressions" rule is that we do not introduce NEW bugs.
>
> It *literally* came about because we had an endless dance of "fix two
> bugs, introduce one new one", and that then resulted in a system that
> you cannot TRUST.
>
> We had years of this with suspend/resume in particular, where the
> argument was always exactly "this fixed many more systems" when
> something else broke.
>
> But because random other things kept breaking, it meant that people
> couldn't upgrade the kernel and feel safe about it.
>
> Your old laptop might no longer work, because somebody had deemed that
> all those *new* laptops were more important.
>
> So I introduced that "no regressions" rule something like two decades
> ago, because people need to be able to update their kernel without
> fear of something they relied on suddenly stopping to work.
>
> The fact that the "suddenly stopped working" may be a minority DOES
> NOT MATTER ONE WHIT.
>
> Stability matters.

I think stability in the context of this problem is an abstract term
especially given the low amount of ARM support for PMUs. Does it
matter more that between Linux 6.9 and 6.10 `perf record -e cycles:pp
.` which is objectively worse than just `perf record ..` (which is
fixed by the patch) work on a Neoverse N1, or someone who is running
Linux on an ARM Apple machine is able to run `perf record -e cycles:pp
.` without triggering a PMU driver bug? Which user is making the
strange request and whose stability matters more?

A phrase I think I coined was the Beyonce principle, she said in a
song, "if you like it, then you should've put a ring on it." The
principle in software is that if you want something then you better
put a test on it - this made it into Titus Winter's book unattributed
so maybe it wasn't me :-). I've been helping drive testing in the perf
tool. If we don't follow the Beyonce principle we're held hostage to
Hyrum's law, "With a sufficient number of users of an API, it does not
matter what you promise in the contract: all observable behaviors of
your system will be depended on by somebody."

Something we could do is rename the ARM dsu PMU's event from cycles to
clockticks (as is conventional in existing PMUs) and then have a test
to enforce that no PMU has events named cycles.

Making this a driver bug doesn't satisfy the desire to shout at the
person whose change exposed the latent issue, but maybe that's the
right fix.

> It's MUCH MUCH better to have legacy bad behavior that you never dealt
> with correctly, than to introduce *new* bugs that hit something that
> used to work.
>
> So something that "perf" has never done correctly is simply not an issue.
>
> You deal with that by saying "that has never worked properly before either".
>
> You might even document it along with (hopefully) possible workarounds.
>
> The whole "one step forward, two steps back" is absolutely fine if you
> are line dancing.
>
> But we're not line dancing.
>
> We take it slow and steady, and if you can't fix something without
> breaking something else, then that thing simply does not get fixed.
>
> And there are always exceptions. Sometimes something may need to be
> broken because it's an acute security issue.
>
> And if it takes a year for somebody to find a regression, and in the
> meantime others have started relying on the new behavior, then at some
> point it's a "yes, that's a regression, but it wasn't reported in a
> timely enough manner".
>
> And sometimes the use-case is basically a museum machine and we retire
> support for it because such a machine should use museum software too.
>
> So it's not like it's some long-term absolute guarantee. But it
> absolutely *is* the #1 rule in the kernel.
>
> A fix that breaks something else is not a fix at all.
>
> And in this case, when the regression was noted within days, and
> within the merge window, there's simply no discussion about it.

We moved development to being in perf-next automatically merging to
linux-next because (as I understand it) there was a complaint of new
possibly untested features coming in during the merge window. If IBM
can test s390 on linux-next and post about breakages in the perf tool,
why is this beyond the people who support and use Neoverse N1s? Why
use linux-next if nobody is testing there?

Thanks,
Ian

> Linus
>