Re: [Patch v5 0/6] Bug fixes on topdown events reordering

From: Ian Rogers
Date: Thu Oct 03 2024 - 11:55:45 EST

Next message: Haiyang Zhang: "RE: [PATCH net] hv_netvsc: Fix VF namespace also in netvsc_open"
Previous message: syzbot: "Re: [syzbot] [bluetooth?] KASAN: slab-use-after-free Write in sco_sock_timeout"
In reply to: Liang, Kan: "Re: [Patch v5 0/6] Bug fixes on topdown events reordering"
Next in thread: Namhyung Kim: "Re: [Patch v5 0/6] Bug fixes on topdown events reordering"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Oct 3, 2024 at 7:57 AM Liang, Kan <kan.liang@xxxxxxxxxxxxxxx> wrote:
>
>
>
> On 2024-10-02 8:57 p.m., Ian Rogers wrote:
> > On Wed, Oct 2, 2024 at 5:00 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> >>
> >> On Tue, Oct 01, 2024 at 03:32:04PM -0700, Ian Rogers wrote:
> >>> On Tue, Oct 1, 2024 at 2:02 PM Namhyung Kim <namhyung@xxxxxxxxxx> wrote:
> >>>>
> >>>> On Fri, 13 Sep 2024 08:47:06 +0000, Dapeng Mi wrote:
> >>>>
> >>>>> Changes:
> >>>>> v5 -> v6:
> >>>>> * no function change.
> >>>>> * rebase patchset to latest code of perf-tool-next tree.
> >>>>> * Add Kan's reviewed-by tag.
> >>>>>
> >>>>> History:
> >>>>> v4: https://lore.kernel.org/all/20240816122938.32228-1-dapeng1.mi@xxxxxxxxxxxxxxx/
> >>>>> v3: https://lore.kernel.org/all/20240712170339.185824-1-dapeng1.mi@xxxxxxxxxxxxxxx/
> >>>>> v2: https://lore.kernel.org/all/20240708144204.839486-1-dapeng1.mi@xxxxxxxxxxxxxxx/
> >>>>> v1: https://lore.kernel.org/all/20240702224037.343958-1-dapeng1.mi@xxxxxxxxxxxxxxx/
> >>>>>
> >>>>> [...]
> >>>>
> >>>> Applied to perf-tools-next, thanks!
> >>>
> >>> I disagreed with an early patch set and the issue wasn't resolved. Specifically:
> >>>
> >>> https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/commit/?h=perf-tools-next&id=3b5edc0421e2598a0ae7f0adcd592017f37e3cdf
> >>> ```
> >>> /* Followed by topdown events. */
> >>> if (arch_is_topdown_metrics(lhs) && !arch_is_topdown_metrics(rhs))
> >>> return -1;
> >>> - if (!arch_is_topdown_metrics(lhs) && arch_is_topdown_metrics(rhs))
> >>> + /*
> >>> + * Move topdown events forward only when topdown events
> >>> + * are not in same group with previous event.
> >>> + */
> >>> + if (!arch_is_topdown_metrics(lhs) && arch_is_topdown_metrics(rhs) &&
> >>> + lhs->core.leader != rhs->core.leader)
> >>> return 1;
> >>> ```
> >>> Is a broken comparator as the lhs then rhs behavior varies from the
> >>> rhs then lhs behavior. The qsort implementation can randomly order the
> >>> events.
> >>> Please drop/revert.
> >>
> >> Can you please provide an example when it's broken? I'm not sure how it
> >> can produce new errors, but it seems to fix a specific problem. Do you
> >> have a new test failure after this change?
> >
> > It may work with a particular sort routine in a particular library
> > today, the issue is that if the sort routine were say changed from:
> >
> > if (cmp(a, b) < 0)
> >
> > to:
> >
> > if (cmp(b, a) > 0)
> >
> > the sort would vary with this change even though such a change in the
> > sorting code is a no-op. It is fundamental algorithms that this code
> > is broken, I'm not going to spend the time finding a list of
> > instructions and a sort routine to demonstrate this.
>
>
> I'm not sure I fully understand. But just want to mention that the
> change only impacts the Topdown perf metric group, which is only
> available for the ICL and later p-core. Nothing will change if no perf
> metrics topdown events are used. Very limited impact.

How is breaking every ICL and later Intel model (excluding atoms)
limited impact?

> I like the patch set is because it provides examples and tests to cover
> the possible combination of the slots event and topdown metrics events.
> So we will have a clear expectation for different combinations caused by
> the perf metrics mess.

Tests are good. I have no problem with the change, possibly it is
inefficient, except that hacking a sort comparator to not be symmetric
is broken.

> If the algorithms cannot be changed, can you please give some
> suggestions, especially for the sample read failure?

So this is symmetric:
```
if (arch_is_topdown_metrics(lhs) && !arch_is_topdown_metrics(rhs))
return -1;
if (!arch_is_topdown_metrics(lhs) && arch_is_topdown_metrics(rhs))
return 1;
```
That is were lhs and rhs swapped then you'd get the expected comparison order.
```
if (arch_is_topdown_metrics(lhs) && !arch_is_topdown_metrics(rhs) &&
lhs->core.leader != rhs->core.leader)
return -1;
if (!arch_is_topdown_metrics(lhs) && arch_is_topdown_metrics(rhs) &&
lhs->core.leader != rhs->core.leader)
return 1;
```
Is symmetric as well.
```
if (arch_is_topdown_metrics(lhs) && !arch_is_topdown_metrics(rhs))
return -1;
if (!arch_is_topdown_metrics(lhs) && arch_is_topdown_metrics(rhs) &&
lhs->core.leader != rhs->core.leader)
return 1;
```
(what this patch does) is not symmetric as the group leader impacts
the greater-than case but not the less-than case.

It is not uncommon to see in a sort function:
```
if (cmp(a, b) <= 0) {
assert(cmp(b,a) >= 0 && "check for unstable/broken compare functions");
```
and we could add this here:
https://git.kernel.org/pub/scm/linux/kernel/git/perf/perf-tools-next.git/tree/tools/lib/list_sort.c?h=perf-tools-next#n22
Try searching up "Comparison method violates its general contract"
which is what Java throws when its TimSort implementation detects
cases like this. If you break the comparator you can get the sort into
an infinite loop - maybe that's enough of an indication that the code
is broken and you don't need the exception. As is common in kernel
code we're missing guard rails in list_sort, were the comparator
passed to qsort then expectations on breakage are a roll of the dice.

I believe when I originally gave this feedback I didn't think the fix
was to do it in the comparator and maybe an additional pass over the
list was going to be necessary. Basically a sort needs to be a sort
and it can't kind of be a sort depending on the order of the
comparison, this is just incurring tech debt and when a sort tweak
happens everything will break. As the usual chump cleaning up this
tech debt I'd prefer it wasn't introduced.

Fwiw, I refuse to carry this change in:
https://github.com/googleprodkernel/linux-perf/commits/google_tools_master/
and if the tests break as a consequence then the appropriate thing is
to revert those too. Linus/upstream maintainers follow a different
plan and that's their business, I can't build off of this code.

It is unclear to me why we get patches that have been pointed out to
be broken rebased/resent repeatedly without the broken-ness corrected.
For example, the abuse of aggregation for metrics in perf script. At
least point to the disagreement in the commit message, call me an
idiot, and let the maintainer make an informed or uninformed choice. I
am frequently an idiot and I apologize for all those cases, to the
best of my knowledge this isn't one of them.

Thanks,
Ian

Next message: Haiyang Zhang: "RE: [PATCH net] hv_netvsc: Fix VF namespace also in netvsc_open"
Previous message: syzbot: "Re: [syzbot] [bluetooth?] KASAN: slab-use-after-free Write in sco_sock_timeout"
In reply to: Liang, Kan: "Re: [Patch v5 0/6] Bug fixes on topdown events reordering"
Next in thread: Namhyung Kim: "Re: [Patch v5 0/6] Bug fixes on topdown events reordering"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]