Re: [RFC PATCH v0 1/3] sched/numa: Process based autonuma scan period framework

From: Raghavendra K T
Date: Wed Jun 21 2023 - 01:52:18 EST

Next message: kernel test robot: "drivers/dma/dw-edma/dw-edma-v0-regs.h:37:4: warning: field sar within 'struct dw_edma_v0_ch_regs' is less aligned than 'union (unnamed union at drivers/dma/dw-edma/dw-edma-v0-regs.h:31:2)' and is usually due to 'struct dw_edma_v0_ch_regs' being packed, wh..."
Previous message: Evan Quan: "[PATCH V4 8/8] drm/amd/pm: enable Wifi RFI mitigation feature support for SMU13.0.7"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

+linux-mm
On 2/1/2022 7:45 PM, Mel Gorman wrote:

On Tue, Feb 01, 2022 at 05:52:55PM +0530, Bharata B Rao wrote:

On 1/31/2022 5:47 PM, Mel Gorman wrote:

On Fri, Jan 28, 2022 at 10:58:49AM +0530, Bharata B Rao wrote:

From: Disha Talreja <dishaa.talreja@xxxxxxx>

Add a new framework that calculates autonuma scan period
based on per-process NUMA fault stats.

NUMA faults can be classified into different categories, such
as local vs. remote, or private vs. shared. It is also important
to understand such behavior from the perspective of a process.
The per-process fault stats added here will be used for
calculating the scan period in the adaptive NUMA algorithm.

Be more specific no how the local vs remote, private vs shared states
are reflections of per-task activity of the same.

Sure, will document the algorithm better. However the overall thinking
here is that the address-space scanning is a per-process activity and
hence the scan period value derived from the accumulated per-process
faults is more appropriate than calculating per-task (per-thread) scan
periods. Participating threads may have their local/shared and private/shared
behaviors, but when aggregated at the process level, it gives a better
input for eventual scan period variation. The understanding is that individual
thread fault rates will start altering the overall process metrics in
such a manner that we respond by changing the scan rate to do more aggressive
or less aggressive scanning.

I don't have anything to add on your other responses as it would mostly
be an acknowledgment of your response.

However, the major concern I have is that address-space wide decisions
on scan rates has no sensible means of adapting to thread-specific
requirements. I completely agree that it will result in more stable scan
rates, particularly the adjustments. It also side-steps a problem where
new threads may start with a scan rate that is completely inappropriate.

However, I worry that it would be limited overall because each thread
potentially has unique behaviour which is not obvious in a workload like
NAS where threads are all executing similar instructions on different
data. For other applications, threads may operate on thread-local areas
only (low scan rate), others could operate on shared only regresions (high
scan rate until back off and interleave), threads can has phase behaviour
(manager thread collecting data from worker threads) and threads can have
different lifetimes and phase behaviour. Each thread would have a different
optimal scan rate to decide if memory needs to be migrated to a local node
or not. I don't see how address-space wide statistics could every be mapped
back to threads to adapt scan rates based on thread-specific behaviour.

Thread scanning on the other hand can be improved in multiple ways. If
nothing else, they can do redundant scanning of regions that are
not relveant to a task which gets increasingly problematic when VSZ
increases. The obvious problems are

1. Scan based on page table updates, not address ranges to mitigate
problems with THP vs base page updates

Hello Mel,
Sorry for digging a very old email, to seek directions on numascanning.

From the list we have handled (2) and (3) below .. and looking forward to continue, with (1) above.

My understanding is when the 256MB limit was introduced, it was mainly
to limit total number PTE we scan (=64k PTEs of 4kB page).

Considering we can do more if we have THP or hugepage, and thus do we
want to cover more hugePTEs here?

I mean can we say we want to scan 64k worth 2MB page table entry (or corresponding hugepage entries)?

I started with a simple patch that just handles 4k/hugepage, but
does not handle THP case properly yet as its not trivial (to track
how much worth of page table entries we handled in a VMA that has THP)
(patch may have white space error because of copying).

Idea is to scan 64k worth of PTEs instead of 256MB for scanning.

Secondly Unrelated to this, I was also thinking if how recently
vma access was done information could be helpful..

Please let me know your suggestion/comment on the direction/approach etc

2. Move scan delay to be a per-vma structure that is kmalloced if
necessary instead of being address space wide.

3. Track what threads access a VMA. The suggestion was to use a unsigned
long pid_mask and use the lower bits to tag approximately what
threads access a VMA. Skip VMAs that did not trap a fault. This would
be approximate because of PID collisions but would reduce scanning
of areas the thread is not interested in

4. Track active regions within VMAs. Very coarse tracking, use unsigned
long to trap what ranges are active

In different ways, this would reduce the amount of scanning work threads
do and focuses them on regions of relevance to reduce overhead overall
without losing thread-specific details.

Unfortunately, I have not had the time yet to prototype anything.

Comments about the patch
- may need to scale virtpages checking as well
- Needs checking of exact THP PTEs covered in scan
- Does not touch task_scan_min() etc which influences scan_period (do we require)???

---8<---

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 6d041aa9f0fe..066e9bee1187 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -260,7 +260,8 @@ int pud_huge(pud_t pud);
long hugetlb_change_protection(struct vm_area_struct *vma,
unsigned long address, unsigned long end, pgprot_t newprot,
unsigned long cp_flags);
-
+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end);
bool is_hugetlb_entry_migration(pte_t pte);
void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..e64430863f9e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2441,6 +2441,8 @@ bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
extern long change_protection(struct mmu_gather *tlb,
struct vm_area_struct *vma, unsigned long start,
unsigned long end, unsigned long cp_flags);
+extern long effective_scanned_ptes(struct vm_area_struct *vma,
+ unsigned long start, unsigned long end);
extern int mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
struct vm_area_struct *vma, struct vm_area_struct **pprev,
unsigned long start, unsigned long end, unsigned long newflags);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 373ff5f55884..a8280f589cbf 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2959,7 +2959,7 @@ static void task_numa_work(struct callback_head *work)
struct vm_area_struct *vma;
unsigned long start, end;
unsigned long nr_pte_updates = 0;
- long pages, virtpages;
+ long pages, virtpages, ptes_to_scan;
struct vma_iterator vmi;

SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
@@ -3006,6 +3006,8 @@ static void task_numa_work(struct callback_head *work)
start = mm->numa_scan_offset;
pages = sysctl_numa_balancing_scan_size;
pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+ /* Consider total number of PTEs to scan rather than sticking to 256MB */
+ ptes_to_scan = pages;
virtpages = pages * 8; /* Scan up to this much virtual space */
if (!pages)
return;
@@ -3099,11 +3101,11 @@ static void task_numa_work(struct callback_head *work)
* areas faster.
*/
if (nr_pte_updates)
- pages -= (end - start) >> PAGE_SHIFT;
- virtpages -= (end - start) >> PAGE_SHIFT;
+ ptes_to_scan -= effective_scanned_ptes(vma, start, end);

+ virtpages -= effective_scanned_ptes(vma, start, end);
start = end;
- if (pages <= 0 || virtpages <= 0)
+ if (ptes_to_scan <= 0 || virtpages <= 0)
goto out;

cond_resched();
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f154019e6b84..9935b462c479 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6841,6 +6841,15 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
return pages > 0 ? (pages << h->order) : pages;
}

+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ struct hstate *h = hstate_vma(vma);
+
+ return (end - start) >> (PAGE_SHIFT + h->order);
+}
+
+
/* Return true if reservation was successful, false otherwise. */
bool hugetlb_reserve_pages(struct inode *inode,
long from, long to,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92d3d3ca390a..8022cb09b47b 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -586,6 +586,16 @@ long change_protection(struct mmu_gather *tlb,
return pages;
}

+long effective_scanned_ptes(struct vm_area_struct *vma, unsigned long start,
+ unsigned long end)
+{
+ if (is_vm_hugetlb_page(vma))
+ return hugetllb_effective_scanned_ptes(vma, start, end);
+
+ return (end - start) >> PAGE_SHIFT;
+}
+
+
static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
unsigned long next, struct mm_walk *walk)
{