On Mon, Oct 19, 2015 at 04:09:09PM -0400, David Woods wrote:
>The arm64 MMU supports a Contiguous bit which is a hint that the TTEThank you for the V2 David,
>is one of a set of contiguous entries which can be cached in a single
>TLB entry. Supporting this bit adds new intermediate huge page sizes.
>
>The set of huge page sizes available depends on the base page size.
>Without using contiguous pages the huge page sizes are as follows.
>
> 4KB: 2MB 1GB
>64KB: 512MB
>
>With a 4KB granule, the contiguous bit groups together sets of 16 pages
>and with a 64KB granule it groups sets of 32 pages. This enables two new
>huge page sizes in each case, so that the full set of available sizes
>is as follows.
>
> 4KB: 64KB 2MB 32MB 1GB
>64KB: 2MB 512MB 16GB
>
>If a 16KB granule is used then the contiguous bit groups 128 pages
>at the PTE level and 32 pages at the PMD level.
>
>If the base page size is set to 64KB then 2MB pages are enabled by
>default. It is possible in the future to make 2MB the default huge
>page size for both 4KB and 64KB granules.
I have some comments below.
I would recommend running the next version of this series through
the libhugetlbfs test suite, as that may pick up a few things too.
Cheers,
-- Steve
> >+static inline pte_t pte_mkcont(pte_t pte)
>+{
>+ pte = set_pte_bit(pte, __pgprot(PTE_CONT));
>+ return set_pte_bit(pte, __pgprot(PTE_TYPE_PAGE));
>+ return pte;
The second return should be removed.
> /*
> * Hugetlb definitions.
> */
>-#define HUGE_MAX_HSTATE 2
>+#define HUGE_MAX_HSTATE ((2 * CONFIG_PGTABLE_LEVELS) - 1)
Not sure about this definition. I would just go with the maximum possible
which is for a 4KB granule:
1 x 1GB pud
1 x 2MB pmd
16 x 2MB pmds
16 x 4KB ptes
So 4 for now?
Right, this is not needed anymore.> #define HPAGE_SHIFT PMD_SHIFTWhy has PTE_CONT been added to the pte_modify mask? This will allow
> #define HPAGE_SIZE (_AC(1, UL) << HPAGE_SHIFT)
> #define HPAGE_MASK (~(HPAGE_SIZE - 1))
>@@ -496,7 +509,7 @@ static inline pud_t *pud_offset(pgd_t *pgd, unsigned long addr)
> static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> {
> const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
>- PTE_PROT_NONE | PTE_VALID | PTE_WRITE;
>+ PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_CONT;
functions such as mprotect to remove the PTE_CONT bit.
Removed.
> >+static inline pte_t pte_modify_pfn(pte_t pte, unsigned long newpfn)
>+{
>+ const pteval_t mask = PHYS_MASK & PAGE_MASK;
>+
>+ pte_val(pte) = pfn_pte(newpfn, (pte_val(pte) & ~mask));
>+ return pte;
>+}
>+
>+static inline pmd_t pmd_modify_pfn(pmd_t pmd, unsigned long newpfn)
>+{
>+ const pmdval_t mask = PHYS_MASK & PAGE_MASK;
>+
>+ pmd = pfn_pmd(newpfn, (pmd_val(pmd) & ~mask));
>+ return pmd;
>+}
pte_modify_pfn and pmd_modify_pfn aren't referenced anywhere in the
patch so should be removed.
Ok.
> >+static int find_num_contig(struct mm_struct *mm, unsigned long addr,We need to check for pgd_present and pud_present as we walk.
>+ pte_t *ptep, pte_t pte, size_t *pgsize)
>+{
>+ pgd_t *pgd = pgd_offset(mm, addr);
>+ pud_t *pud;
>+ pmd_t *pmd;
>+
>+ if (!pte_cont(pte))
>+ return 1;
>+
>+ pud = pud_offset(pgd, addr);
>+ pmd = pmd_offset(pud, addr);
I would be tempted to VM_BUG_ON if they are in an unexpected state.
>+ if ((pte_t *)pmd == ptep) {I would check for pmd_present and VM_BUG_ON if it wasn't in an expected
>+ *pgsize = PMD_SIZE;
>+ return CONT_PMDS;
>+ }
state.
>+ *pgsize = PAGE_SIZE;Another approach would be something like:
>+ return CONT_PTES;
>+}
struct vm_area_struct *vma = find_vma(mm, addr);
struct hstate *h = hstate_vma(vma);
size_t size = hpage_size(h);
But I think looking at the page table entries like you've done (with
some checking) may be a little better as it can supply some more robust
debugging with DEBUG_VM selected (and it doesn't need to find_vma).
>+We don't need this extern.
>+extern void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
>+ pte_t *ptep, pte_t pte)
Ok.>+{We can return early here and avoid a level of indentation below.
>+ size_t pgsize;
>+ int ncontig = find_num_contig(mm, addr, ptep, pte, &pgsize);
>+
>+ if (ncontig == 1) {
>+ set_pte_at(mm, addr, ptep, pte);
>+ } else {nit: pgsize >> PAGE_SHIFT
>+ int i;
>+ unsigned long pfn = pte_pfn(pte);
>+ pgprot_t hugeprot =
>+ __pgprot(pte_val(pfn_pte(pfn, 0) ^ pte_val(pte)));
>+ for (i = 0; i < ncontig; i++) {
>+ pr_debug("%s: set pte %p to 0x%llx\n", __func__, ptep,
>+ pfn_pte(pfn, hugeprot));
>+ set_pte_at(mm, addr, ptep, pfn_pte(pfn, hugeprot));
>+ ptep++;
>+ pfn += pgsize / PAGE_SIZE;
>+ addr += pgsize;I see... so the contiguous pte and pmd cases are folded together.
>+ }
>+ }
>+}
The pgsize variable name could be changed, perhaps something like blocksize?
(I am terrible at picking names though :-)).
>+Probably better to simplify the levels of indentation with:
>+pte_t *huge_pte_alloc(struct mm_struct *mm,
>+ unsigned long addr, unsigned long sz)
>+{
>+ pgd_t *pgd;
>+ pud_t *pud;
>+ pte_t *pte = NULL;
>+
>+ pr_debug("%s: addr:0x%lx sz:0x%lx\n", __func__, addr, sz);
>+ pgd = pgd_offset(mm, addr);
>+ pud = pud_alloc(mm, pgd, addr);
if (!pud)
return NULL;
(or goto out before your pr_debug)
>+ if (pud) {Perhaps better to do something with switch(sz) below?
Ok.
>+ if (sz == PUD_SIZE) {This can be simplified to something like:
>+ pte = (pte_t *)pud;
>+ } else if (sz == (PAGE_SIZE * CONT_PTES)) {
>+ pmd_t *pmd = pmd_alloc(mm, pud, addr);
>+
>+ WARN_ON(addr & (sz - 1));
>+ pte = pte_alloc_map(mm, NULL, pmd, addr);
>+ } else if (sz == PMD_SIZE) {
>+#ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
>+ if (pud_none(*pud))
>+ pte = huge_pmd_share(mm, addr, pud);
>+ else
>+#endif
if (IS_ENABLED(CONFIG_ARCH_WANT_HUGE_PMD_SHARE)
&& pud_none(*pud))
else
So we can remove the preprocessor macros.
>+Ok.
>+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
>+{
>+ pgd_t *pgd;
>+ pud_t *pud;
>+ pmd_t *pmd = NULL;
>+ pte_t *pte = NULL;
>+
>+ pgd = pgd_offset(mm, addr);
>+ pr_debug("%s: addr:0x%lx pgd:%p\n", __func__, addr, pgd);
>+ if (pgd_present(*pgd)) {
Again drop a level of indentation with:
if (!pgd_present(*pgd))
return NULL;
Similarly for pud_present and pmd_present.
Same as above.
>+}
>+
>+pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
>+ struct page *page, int writable)
>+{
>+ size_t pagesize = huge_page_size(hstate_vma(vma));
>+
I would go for switch(pagesize) here.
>+ if (pagesize == CONT_PTE_SIZE) {nit: Do we need an initial value for pte?
>+ entry = pte_mkcont(entry);
>+ } else if (pagesize == CONT_PMD_SIZE) {
>+ entry = pmd_pte(pmd_mkcont(pte_pmd(entry)));
>+ } else if (pagesize != PUD_SIZE && pagesize != PMD_SIZE) {
>+ pr_warn("%s: unrecognized huge page size 0x%lx\n",
>+ __func__, pagesize);
>+ }
>+ return entry;
>+}
>+
>+extern pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
>+ unsigned long addr, pte_t *ptep)
>+{
>+ pte_t pte = {0};
This is the bug I mentioned above which was caught by the test suite.
>+
>+ if (pte_cont(*ptep)) {
>+ int ncontig, i;
>+ size_t pgsize;
>+ pte_t *cpte;
>+ bool is_dirty = false;
>+
>+ cpte = huge_pte_offset(mm, addr);
>+ ncontig = find_num_contig(mm, addr, cpte,
>+ pte_val(*cpte), &pgsize);
>+ /* save the 1st pte to return */
>+ pte = ptep_get_and_clear(mm, addr, cpte);
>+ for (i = 1; i < ncontig; ++i) {
>+ if (pte_dirty(ptep_get_and_clear(mm, addr, ++cpte)))
>+ is_dirty = true;
>+ }
Nice, we are keeping track of the dirty state. This looks to me likeI added a comment to try to make all this more clear.
it*should* work well with the dirty bit management patch that Catalin
introduced:
2f4b829 arm64: Add support for hardware updates of the access and dirty pte bits
Because ptep_get_and_clear will atomically get and clear the pte with
respect to the hardware dirty bit management thus we don't lose any
dirty information. huge_pte_dirty is then called on the extracted pte
by core code.
For a contiguous set of ptes/pmds the individual entry will be dirtied
by DBM rather than the complete set so it's good to check them all for
dirty when going through a get and clear.
Technically we don't need to track dirty if CONFIG_ARM64_HW_AFDBM is
not defined as the core code will fault and modify the entire set of
ptes otherwise.
I would be tempted to keep this code as is, but add a comment that
tracking the dirty variable here helps for when we switch on
CONFIG_ARM64_HW_AFDBM.
This is intentional and in a way, the motivation for these changes. We're>+Why is this initcall defined? Was it for testing?
>+#ifdef CONFIG_ARM64_64K_PAGES
>+static __init int add_default_hugepagesz(void)
>+{
>+ if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
>+ hugetlb_add_hstate(CONT_PMD_SHIFT);
>+ return 0;
>+}
>+arch_initcall(add_default_hugepagesz);
>+#endif
huge_ptep_set_access_flags() was there already, but I've added
I think we are missing a few functions:
huge_ptep_set_access_flags
huge_ptep_set_wrprotect
huge_ptep_clear_flush
These functions need to loop through the contiguous set of ptes
or pmds. They should call into the ptep_ equivalents as they will
then work with the DBM patch.