Re: [PATCH v9 00/21] Generic page walk and ptdump

From: Steven Price
Date: Thu Jul 25 2019 - 14:05:13 EST

On 25/07/2019 10:09, Anshuman Khandual wrote:
> On 07/24/2019 07:05 PM, Steven Price wrote:
>> On 23/07/2019 07:39, Anshuman Khandual wrote:
>>> Hello Steven,
>>> On 07/22/2019 09:11 PM, Steven Price wrote:
>>>> This is a slight reworking and extension of my previous patch set
>>>> (Convert x86 & arm64 to use generic page walk), but I've continued the
>>>> version numbering as most of the changes are the same. In particular
>>>> this series ends with a generic PTDUMP implemention for arm64 and x86.
>>>> Many architectures current have a debugfs file for dumping the kernel
>>>> page tables. Currently each architecture has to implement custom
>>>> functions for this because the details of walking the page tables used
>>>> by the kernel are different between architectures.
>>>> This series extends the capabilities of walk_page_range() so that it can
>>>> deal with the page tables of the kernel (which have no VMAs and can
>>>> contain larger huge pages than exist for user space). A generic PTDUMP
>>>> implementation is the implemented making use of the new functionality of
>>>> walk_page_range() and finally arm64 and x86 are switch to using it,
>>>> removing the custom table walkers.
>>> Could other architectures just enable this new generic PTDUMP feature if
>>> required without much problem ?
>> The generic PTDUMP is implemented as a library - so the architectures
>> would have to provide the call into ptdump_walk_pgd() and provide the
>> necessary callback note_page() which formats the lines in the output.
> Though I understand that the leaf flag (any given level) details are very much
> arch specific would there be any possibility for note_page() call back to be
> unified as well. This is extracted from current PTDUMP output on arm64.
> 0xffffffc000000000-0xffffffc000080000 512K PTE RW NX SHD AF UXN MEM/NORMAL
> The first three columns are generic
> 1. Kernel virtual range span
> 2. Kernel virtual range size
> 3. Kernel virtual range mapping level
> Where as rest of the output are architecture specific page table entry flags.
> Just wondering if we could print the first three columns in ptdump_walk_pgd()
> itself before calling arch specific callback to fetch a print buffer for rest
> of the line bounded with some character limit so that line does not overflow.
> Its not something which must be done but I guess it's worth giving it a try.
> This will help consolidate ptdump_walk_pgd() further.

It's not quite as simple as it seems. One of the things note_page() does
is work out whether a contiguous set of pages are "the same" (i.e.
should appear as one range). This is ultimately an architecture specific
decision: we need to look at the flags to do this.

I'm of course happy to be proved wrong if you can see a neat way of
making this work.

>> Hopefully the implementation is generic enough that it should be
>> flexible enough to work for most architectures.
>> arm, powerpc and s390 are the obvious architectures to convert next as
>> they already have note_page() functions which shouldn't be too difficult
>> to convert to match the callback prototype.
> Which can be done independently later on, fair enough.
>>>> To enable a generic page table walker to walk the unusual mappings of
>>>> the kernel we need to implement a set of functions which let us know
>>>> when the walker has reached the leaf entry. After a suggestion from Will
>>>> Deacon I've chosen the name p?d_leaf() as this (hopefully) describes
>>>> the purpose (and is a new name so has no historic baggage). Some
>>>> architectures have p?d_large macros but this is easily confused with
>>>> "large pages".
>>> I have not been following the previous version of the series closely, hence
>>> might be missing something here. But p?d_large() which identifies large
>>> mappings on a given level can only signify a leaf entry. Large pages on the
>>> table exist only as leaf entries. So what is the problem for it being used
>>> directly instead. Is there any possibility in the kernel mapping when these
>>> large pages are not leaf entries ?
>> There isn't any problem as such with using p?d_large macros. However the
>> name "large" has caused confusion in the past. In particular there are
>> two types of "large" page:
>> 1. leaf entries at high levels than normal ('sections' on Arm, for 4K
>> pages this gives you 2MB and 1GB pages).
>> 2. sets of contiguous entries that can share a TLB entry (the
>> 'Contiguous bit' on Arm - which for 4K pages gives you 16 entries = 64
>> KB 'pages').
> This is arm64 specific and AFAIK there are no other architectures where there
> will be any confusion wrt p?d_large() not meaning a single entry.

This isn't arm64 specific (or even Arm specific) - only the examples I
gave are. There are several architectures with software walks where the
TLB can be populated with arbitrary sized entries. I have to admit I
don't fully understand the page table layouts of many of the other
architectures that Linux supports.

> As you have noted before if we are printing individual entries with PTE_CONT
> then they need not be identified as p??d_large(). In which case p?d_large()
> can just safely point to p?d_sect() identifying regular huge leaf entries.

The printing is largely irrelevant here (it's handled by arch code), so
PTE_CONT isn't a problem. However to walk the page tables we need to
know precisely "is this the leaf of the tree", we don't really care what
size page is being mapped, just whether we should continue the walk or not.

>> In many cases both give the same effect (reduce pressure on TLBs and
>> requires contiguous and aligned physical addresses). But for this case
>> we only care about the 'leaf' case (because the contiguous bit makes no
>> difference to walking the page tables).
> Right and we can just safely identify section entries with it. What will be
> the problem with that ? Again this is only arm64 specific.

It's not arm64 specific.

>> As far as I'm aware p?d_large() currently implements the first and
>> p?d_(trans_)huge() implements either 1 or 2 depending on the architecture.
> AFAIK option 2 exists only on arm6 platform. IIUC generic MM requires two
> different huge page dentition from platform. HugeTLB identifies large entries
> at PGD|PUD|PMD after converting it's content into PTE first. So there is no
> need for direct large page definitions for other levels.
> 1. THP - pmd_trans_huge()
> 2. HugeTLB - pte_huge() CONFIG_ARCH_WANT_GENERAL_HUGETLB is set
> A simple check for p?d_large() on mm/ and include/linux shows that there are
> no existing usage for these in generic MM. Hence it is available.

As Will has already replied - this is probably a good opportunity to
pick a better name - arch code can then be tidied up to use the new name.

> Currently that is a difference between x86 and arm64 ptdump output. Whether to
> show the gaps or not could not be achieved by defining a note_page() callback
> function which does nothing but just return ? But if the single line output is
> split between generic and callback as I had proposed earlier this will not be
> possible any more as half the line would have been already printed.

I think the proposal at the moment is for arm64 to match x86 as it seems
like it would be useful to know at what level the gaps are. But I also
like giving each arch the flexibility to display what information is
relevant for that architecture. It's the custom page walkers I'm trying
to remove as really there isn't much difference between architectures
there (as lots of generic code has to deal with page tables in one way
or another).