On 24.05.22 20:32, Chuck Zmudzinski wrote:
On 5/21/22 6:47 AM, Thorsten Leemhuis wrote:Yeah, I'm aware of it. But it seems...
On 20.05.22 16:48, Chuck Zmudzinski wrote:You are mostly right, I think. Reverting bdd8b6c98239 fixes
On 5/20/2022 10:06 AM, Jan Beulich wrote:Sorry, for you maybe, but I'm pretty sure for Linus it's not when it
On 20.05.2022 15:33, Chuck Zmudzinski wrote:
On 5/20/2022 5:41 AM, Jan Beulich wrote:The driver ought to work fine without PAT (and hence without being
On 20.05.2022 10:30, Chuck Zmudzinski wrote:I wouldn't call it a driver bug, but an incorrect configuration of the
On 5/20/2022 2:59 AM, Chuck Zmudzinski wrote:Because that's, in my understanding, is the purpose of "nopat"
On 5/20/2022 2:05 AM, Jan Beulich wrote:Actually I just did a test with the last official Debian kernel
On 20.05.2022 06:43, Chuck Zmudzinski wrote:
On 5/4/22 5:14 AM, Juergen Gross wrote:Hmm, why would the kernel not be allowed to override that? Such
On 04.05.22 10:31, Jan Beulich wrote:That is why your patch would not fix my Haswell unless
On 03.05.2022 15:22, Juergen Gross wrote:Oh, I missed that one, sorry.
... these uses there are several more. You say nothing on why
those want
leaving unaltered. When preparing my earlier patch I did
inspect them
and came to the conclusion that these all would also better
observe the
adjusted behavior (or else I couldn't have left pat_enabled()
as the
only predicate). In fact, as said in the description of my
earlier
patch, in
my debugging I did find the use in i915_gem_object_pin_map()
to be
the
problematic one, which you leave alone.
it also touches i915_gem_object_pin_map() in
drivers/gpu/drm/i915/gem/i915_gem_pages.c
I wanted to be rather defensive in my changes, but I agree atI think your approach needs to be more aggressive so it will fix
least
the
case in arch_phys_wc_add() might want to be changed, too.
all the known false negatives introduced by bdd8b6c98239
such as the one in i915_gem_object_pin_map().
I looked at Jan's approach and I think it would fix the issue
with my Haswell as long as I don't use the nopat option. I
really don't have a strong opinion on that question, but I
think the nopat option as a Linux kernel option, as opposed
to a hypervisor option, should only affect the kernel, and
if the hypervisor provides the pat feature, then the kernel
should not override that,
an override would affect only the single domain where the
kernel runs; other domains could take their own decisions.
Also, for the sake of completeness: "nopat" used when running on
bare metal has the same bad effect on system boot, so there
pretty clearly is an error cleanup issue in the i915 driver. But
that's orthogonal, and I expect the maintainers may not even care
(but tell us "don't do that then").
build of Linux 5.16, that is, a kernel before bdd8b6c98239 was
applied. In fact, the nopat option does *not* break the i915 driver
in 5.16. That is, with the nopat option, the i915 driver loads
normally on both the bare metal and on the Xen hypervisor.
That means your presumption (and the presumption of
the author of bdd8b6c98239) that the "nopat" option was
being observed by the i915 driver is incorrect. Setting "nopat"
had no effect on my system with Linux 5.16. So after doing these
tests, I am against the aggressive approach of breaking the i915
driver with the "nopat" option because prior to bdd8b6c98239,
nopat did not break the i915 driver. Why break it now?
(not breaking the driver of course - that's a driver bug -, but
having an effect on the driver).
kernel by the user. I presume X86_FEATURE_PAT is required by the
i915 driver
able to make WC mappings). It would use UC instead and be slow, but
it ought to work.
and therefore the driver should refuse to disableNo, I'm not saying that. It was wrong for this construct to be used
it if the user requests to disable it and instead warn the user that
the driver did not disable the feature, contrary to what the user
requested with the nopat option.
In any case, my test did not verify that when nopat is set in Linux
5.16,
the thread takes the same code path as when nopat is not set,
so I am not totally sure that the reason nopat does not break the
i915 driver in 5.16 is that static_cpu_has(X86_FEATURE_PAT)
returns true even when nopat is set. I could test it with a custom
log message in 5.16 if that is necessary.
Are you saying it was wrong for static_cpu_has(X86_FEATURE_PAT)
to return true in 5.16 when the user requests nopat?
in the driver, which was fixed for 5.17 (and which had caused the
regression I did observe, leading to the patch as a hopefully least
bad option).
I think that isAgain no - my view is that there's a separate, pre-existing issue
just permitting a bad configuration to break the driver that a
well-written operating system should not allow. The i915 driver
was, in my opinion, correctly ignoring the nopat option in 5.16
because that option is not compatible with the hardware the
i915 driver is trying to initialize and setup at boot time. At least
that is my understanding now, but I will need to test it on 5.16
to be sure I understand it correctly.
Also, AFAICT, your patch would break the driver when the nopat
option is set and only fix the regression introduced by bdd8b6c98239
when nopat is not set on my box, so your patch would
introduce a regression relative to Linux 5.16 and earlier for the
case when nopat is set on my box. I think your point would
be that it is not a regression if it is an incorrect user
configuration.
in the driver which was uncovered by the change. This may be a
perceived regression, but is imo different from a real one.
comes to the "no regressions rule". Just took a quick look at quotes
from Linus
https://www.kernel.org/doc/html/latest/process/handling-regressions.html
and found this statement from Linus to back this up:
```
One _particularly_ last-minute revert is the top-most commit (ignoring
the version change itself) done just before the release, and while
it's very annoying, it's perhaps also instructive.
What's instructive about it is that I reverted a commit that wasn't
actually buggy. In fact, it was doing exactly what it set out to do,
and did it very well. In fact it did it _so_ well that the much
improved IO patterns it caused then ended up revealing a user-visible
regression due to a real bug in a completely unrelated area.
```
He said that here:
https://www.kernel.org/doc/html/latest/process/handling-regressions.html
The situation is of course different here, but similar enough.
Since it is a regression, I think for now bdd8b6c98239 shouldI'm not a developer and I'm don't known the details of this thread and
be reverted and the fix backported to Linux 5.17 stable until
the underlying memory subsystem can provide the i915 driver
with an updated test for the PAT feature that also meets the
requirements of the author of bdd8b6c98239 without breaking
the i915 driver.
the backstory of the regression, but it sounds like that's the approach
that is needed here until someone comes up with a fix for the regression
exposed by bdd8b6c98239.
But if I'm wrong, please tell me.
it. There is another way to fix it, though.
The patch proposed...that approach is not making any progress either?
by Jan Beulich also fixes the regression on my system, so as
the person reporting this is a regression, I would also be satisfied
with Jan's patch instead of reverting bdd8b6c98239 as a fix. Jan
posted his proposed patch here:
https://lore.kernel.org/lkml/9385fa60-fa5d-f559-a137-6608408f88b0@xxxxxxxx/
Jan, can could provide a short status update here? I'd really like to
get this regression fixed one way or another rather sooner than later,
as this is taken way to long already IMHO.
The only reservation I have about Jan's patch is that the commitSure.
message does not clearly explain how the patch changes what
the nopat kernel boot option does. It doesn't affect me because
I don't use nopat, but it should probably be mentioned in the
commit message, as pointed out here:
https://lore.kernel.org/lkml/bd9ed2c2-1337-27bb-c9da-dfc7b31d492c@xxxxxxxxxxxx/
Whatever fix for the regression exposed by bdd8b6c98239 also
needs to be backported to the stable versions 5.17 and 5.18.
BTW, as you seem to be familiar with the issue: there is another report
about a regression WRT to Xen and i915 (that is also not making really
progress):
https://lore.kernel.org/lkml/Yn%2FTgj1Ehs%2FBdpHp@itl-email/
It's just a wild guess, but bould this somehow be related?