Re: [PATCH] sched/numa: Stagger NUMA balancing scan periods for new threads

From: Mel Gorman
Date: Fri Apr 27 2018 - 03:37:40 EST


On Fri, Apr 27, 2018 at 08:50:57AM +0200, Ingo Molnar wrote:
> > 4.17.0-rc1 4.17.0-rc1
> > vanilla stagger-v1r1
> > sys-time-bt.D 48.78 ( 0.00%) 48.22 ( 1.15%)
> > sys-time-cg.D 25.31 ( 0.00%) 26.63 ( -5.22%)
> > sys-time-ep.D 1.65 ( 0.00%) 0.62 ( 62.42%)
> > sys-time-is.D 40.05 ( 0.00%) 24.45 ( 38.95%)
> > sys-time-lu.D 37.55 ( 0.00%) 29.02 ( 22.72%)
> > sys-time-mg.D 47.52 ( 0.00%) 34.92 ( 26.52%)
> > sys-time-sp.D 119.01 ( 0.00%) 109.05 ( 8.37%)
> > sys-time-ua.D 51.52 ( 0.00%) 45.13 ( 12.40%)
> >
> > NUMA scan activity is reduced as well as other balancing activity.
> >
> > NUMA alloc local 1042828 1342670
> > NUMA base PTE updates 140481138 93577468
> > NUMA huge PMD updates 272171 180766
> > NUMA page range updates 279832690 186129660
> > NUMA hint faults 1395972 1193897
> > NUMA hint local faults 877925 855053
> > NUMA hint local percent 62 71
> > NUMA pages migrated 12057909 9158023
>
> Looks like a nice reduction in scanning overhead - which was always the main worry
> with the fault based active NUMA balancing technique.
>
> I have a couple of minor code cleanliness nit:
>

No problem.

> > +void init_numa_balancing(unsigned long clone_flags, struct task_struct *p)
> > +{
> > + int mm_users = 0;
> > +
> > + if (p->mm) {
> > + mm_users = atomic_read(&p->mm->mm_users);
> > + if (mm_users == 1) {
> > + p->mm->numa_next_scan = jiffies + msecs_to_jiffies(sysctl_numa_balancing_scan_delay);
> > + p->mm->numa_scan_seq = 0;
> > + }
> > + }
> > + p->node_stamp = 0ULL;
> > + p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> > + p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> > + p->numa_work.next = &p->numa_work;
> > + p->numa_faults = NULL;
> > + p->last_task_numa_placement = 0;
> > + p->last_sum_exec_runtime = 0;
> > + p->numa_group = NULL;
>
> While this is pre-existing code that you moved, could we please use a bit more
> organization to make this more readable:
>
> p->node_stamp = 0ULL;
> p->numa_scan_seq = p->mm ? p->mm->numa_scan_seq : 0;
> p->numa_scan_period = sysctl_numa_balancing_scan_delay;
> p->numa_work.next = &p->numa_work;
> p->numa_faults = NULL;
> p->last_task_numa_placement = 0;
> p->last_sum_exec_runtime = 0;
> p->numa_group = NULL;
>
> ?
>
> This form made me notice a detail: the 0ULL asymmetry looks weird, this integer
> literal type specification is entirely superfluous here, we can just write '0'.
>

Can do. I'll revise it and should send a new version on Monday. You're
right that the type is superfluous, this was simply a code movement so I
kept the structure.

> > + /* New address space */
> > + if (!(clone_flags & CLONE_VM)) {
> > + p->numa_preferred_nid = -1;
> > + return;
> > + }
> > +
> > + /* New thread, use existing preferred nid but stagger scans */
> > + if (p->mm) {
> > + unsigned int delay;
> > +
> > + delay = min_t(unsigned int, task_scan_max(current),
> > + current->numa_scan_period * mm_users * NSEC_PER_MSEC);
> > + delay += 2 * TICK_NSEC;
> > + p->numa_preferred_nid = current->numa_preferred_nid;
> > + p->node_stamp = delay;
> > + }
>
> So this is a fork time function, shouldn't p->numa_preferred_nid be equal to
> current->numa_preferred_nid already?
>

It should but I see no harm in being explicit.

> This is what happens in the !p->mm && CLONE_VM case anyway, right?
>

!p->mm should never have changed numa_preferred_nid and it's useless
information anyway. Kernel threads do not have a user address space to
sample and migrate.

> So we could leave out the superfluous assignment and make it clear in a comment
> that we inherit the parent's ->numa_preferred_nid intentionally.
>

I can do that.

> Also, there's a lot of p->mm use, could we add this helper local variable to
> simplify the code some more:
>
> struct mm_struct *mm = p->mm;
>
> ?

That should be harmless.

Thanks!

--
Mel Gorman
SUSE Labs