Re: [PATCH 2/2] mm: mincore: use folio_pte_batch() to batch process large folios

From: Baolin Wang
Date: Thu Mar 27 2025 - 08:00:27 EST




On 2025/3/27 18:49, Oscar Salvador wrote:
On Wed, Mar 26, 2025 at 11:38:11AM +0800, Baolin Wang wrote:
@@ -118,16 +120,31 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
walk->action = ACTION_AGAIN;
return 0;
}
- for (; addr != end; ptep++, addr += PAGE_SIZE) {
+ for (; addr != end; ptep += step, addr += step * PAGE_SIZE) {
pte_t pte = ptep_get(ptep);
+ step = 1;
/* We need to do cache lookup too for pte markers */
if (pte_none_mostly(pte))
__mincore_unmapped_range(addr, addr + PAGE_SIZE,
vma, vec);
- else if (pte_present(pte))
- *vec = 1;
- else { /* pte is a swap entry */
+ else if (pte_present(pte)) {
+ if (pte_batch_hint(ptep, pte) > 1) {

AFAIU, you will only batch if the CONT_PTE is set, but that is only true for arm64,
and so we lose the ability to batch in e.g: x86 when we have contiguous
entries, right?

So why not have folio_pte_batch take care of it directly without involving
pte_batch_hint here?

Good question, this was the first approach I tried.

However, I found there was a obvious performance regression with small folios (where CONT_PTE is not set). I think the overhead introduced by vm_normal_folio() and folio_pte_batch() is greater than the optimization gained from batch processing small folios.

For large folios where CONT_PTE is set, ptep_get()--->contpte_ptep_get() wastes a significant amount of CPU time, so using folio_pte_batch() can improve the performance obviously.