Re: [PATCH Part2 RFC v4 09/40] x86/fault: Add support to dump RMP entry on fault
From: Dave Hansen
Date: Thu Jul 08 2021 - 12:58:57 EST
On 7/8/21 9:48 AM, Brijesh Singh wrote:
> On 7/8/21 10:30 AM, Dave Hansen wrote:
>>> The reason for iterating through 2MB region is; if the faulting address
>>> is not assigned in the RMP table, and page table walk level is 2MB then
>>> one of entry within the large page is the root cause of the fault. Since
>>> we don't know which entry hence I dump all the non-zero entries.
>>
>> Logically you can figure this out though, right? Why throw 511 entries
>> at the console when we *know* they're useless?
>
> Logically its going to be tricky to figure out which exact entry caused
> the fault, hence I dump any non-zero entry. I understand it may dump
> some useless.
What's tricky about it?
Sure, there's a possibility that more than one entry could contribute to
a fault. But, you always know *IF* an entry could contribute to a fault.
I'm fine if you run through the logic, don't find a known reason
(specific RMP entry) for the fault, and dump the whole table in that
case. But, unconditionally polluting the kernel log with noise isn't
very nice for debugging.
>>> There are two cases which we need to consider:
>>>
>>> 1) the faulting page is a guest private (aka assigned)
>>> 2) the faulting page is a hypervisor (aka shared)
>>>
>>> We will be primarily seeing #1. In this case, we know its a assigned
>>> page, and we can decode the fields.
>>>
>>> The #2 will happen in rare conditions,
>>
>> What rare conditions?
>
> One such condition is RMP "in-use" bit is set; see the patch 20/40.
> After applying the patch we should not see "in-use" bit set. If we run
> into similar issues, a full RMP dump will greatly help debug.
OK... so dump the "in-use" bit here if you see it.
>>> if it happens, one of the undocumented bit in the RMP entry can
>>> provide us some useful information hence we dump the raw values.
>> You're saying that there are things that can cause RMP faults that
>> aren't documented? That's rather nasty for your users, don't you think?
>
> The "in-use" bit in the RMP entry caught me off guard. The AMD APM does
> says that hardware sets in-use bit but it *never* explained in the
> detail on how to check if the fault was due to in-use bit in the RMP
> table. As I said, the documentation folks will be updating the RMP entry
> to document the in-use bit. I hope we will not see any other
> undocumented surprises, I am keeping my finger cross :)
Oh, ok. That sounds fine. Documentation is out of date all the time.