Re: next/master bisection: baseline.login on rk3288-rock2-square
From: Ard Biesheuvel
Date: Thu Feb 04 2021 - 10:59:59 EST
On Thu, 4 Feb 2021 at 15:36, Russell King - ARM Linux admin
<linux@xxxxxxxxxxxxxxx> wrote:
>
> On Thu, Feb 04, 2021 at 03:25:20PM +0100, Ard Biesheuvel wrote:
> > Pushing contents of the cache hierarchy to main memory is *not* a
> > valid use of set/way ops, and so there is no point in pretending that
> > set/way ops will produce the same results as by-VA ops. Only the by-VA
> > ops give the architectural guarantees that we rely on for correctness.
>
> ... yet we /were/ doing that, and it worked fine for 13 years - from
> 1st June 2007 until the by-VA merge into mainline on the 3rd April
> 2020.
>
> You may be right that it wasn't the most correct way, but it worked
> for those 13 years without issue, and it's only recently that it's
> become a problem, and trying to "fix" that introduced a regression,
> and fixing that regression has caused another regression... and I
> what I'm wondering is how many more regression fixing cycles it's
> going to take - how many regression fixes on top of other regression
> fixes are we going to end up seeing here.
>
> The fact is, we never properly understood why your patch caused the
> regression I was seeing. If we don't understand it, then we can never
> say that we've fixed the problem properly. That is highlighted by the
> fact that fixing the regression I was seeing has caused another
> regression.
>
I agree with all these points.
But as I pointed out, reverting the original by-VA change, which has
been there for almost a year now, has some risks of its own. If we
don't understand the details of how this is broken, how can we be sure
we don't break something else if we revert it again?
So I'm not arguing that reverting the original patch is unreasonable,
just that doing so at this point in the cycle is risky, and that it
would be better to queue the revert for v5.12, and only backport it
after some soak time in -next. And in a sense, reinstating the
cache_clean() before cache_off() already amounts to a partial revert
of the queued fix.