Re: [PATCH v2] Subject: x86/PAT: Report PAT on CPUs that support PAT without MTRR

From: Chuck Zmudzinski
Date: Wed Jul 13 2022 - 15:38:25 EST


On 7/13/2022 3:22 PM, Chuck Zmudzinski wrote:
> On 7/13/2022 3:07 PM, Chuck Zmudzinski wrote:
> > On 7/13/2022 9:45 AM, Juergen Gross wrote:
> > > >> On 7/13/2022 6:36 AM, Chuck Zmudzinski wrote:
> > > >> And in addition, if we are going to backport this patch to
> > > >> all current stable branches, we better have a really, really,
> > > >> good reason for changing the behavior of "nopat" on Xen.
> > > >>
> > > >> Does such a reason exist?
> > > >
> > > > Well, the simple reason is: It doesn't work the same way under Xen
> > > > and non-Xen (in turn because, before my patch or whatever equivalent
> > > > work, things don't work properly anyway, PAT-wise). Yet it definitely
> > > > ought to behave the same everywhere, imo.
> > >
> > > There is Documentation/x86/pat.rst which rather clearly states, how
> > > "nopat" is meant to work. It should not change the contents of the
> > > PAT MSR and keep it just as it was set at boot time (the doc talks
> > > about the "BIOS" setting of the MSR, and I guess in the Xen case
> > > the hypervisor is kind of acting as the BIOS).
> > >
> > > The question is, whether "nopat" needs to be translated to
> > > pat_enabled() returning "false".
> >
> > When I started working on a re-factoring effort of the logic
> > surrounding pat_enabled(), I noticed there are five different
> > reasons in the current code for setting pat_disabled to true,
> > which IMO is what should be a redundant variable that should
> > always be equal !pat_enabled() and !pat_bp_enabled, but that
> > unfortunately is not the case. The five reasons for setting
> > pat_disabled to true are given as message strings:
> >
> > 1. "MTRRs disabled, skipping PAT initialization too."
> > 2. "PAT support disabled because CONFIG_MTRR is disabled in the kernel."
> > 3. "PAT support disabled via boot option."
> > 4. "PAT not supported by the CPU."
> > 5. "PAT support disabled by the firmware."
> >
> > The only effect of setting pat_disabled to true is to inhibit
> > the execution of pat_init(), but it does not inhibit the execution
> > of init_cache_modes(), which is for handling all these cases
> > when pat_init() was skipped. The Xen case is one of those
> > cases, so in the Xen case, pat_disabled will be true yet the
> > only way to fix the current regression and the five-year-old
> > commit is by setting pat_bp_enabled to true so pat_enabled()
> > will return true. So to fix the five-year-old commit, we must have
> >
> > pat_enabled() != pat_disabled
> >
> > Something is wrong with this logic, that is why I wanted to precede
> > my fix with some re-factoring that will change some variable
> > and function names and modify some comments before trying
> > to fix the five-year-old commit, so that we will never have a situation
> > when pat_enabled() != pat_disabled.
> >
> > Chuck
> Sorry, I meant to say,
>
> To fix the five-year-old commit, we must have
>
> pat_enabled() != !pat_disabled or pat_enabled() == pat_disabled,
>
> and there is something wrong with that logic.
>
> Chuck

So to summarize, I think this means that to be comfortable
fixing the five-year-old commit and the current regression
by artificially setting pat_bp_enabled and pat_enabled() to
true, something which both my patch and Jan's patch does,
we need to come to a new understanding of what the
static boolean variable pat_disabled in
arch/x86/mm/pat/memtype.c in the code really means.

The fact is, we have a regression and the only fix we
can find is to try to make pat_enabled() == pat_disabled

I need to stop thinking about this for a while. It is time
for those who have authority to fix this regression to
make some comments about how they think this should
be fixed.

Chuck