Re: [PATCH v2 2/4] sched:Consider imbalance_pct when comparing loads in numa_has_capacity

From: Ingo Molnar
Date: Tue Jun 23 2015 - 04:10:56 EST



* Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx> wrote:

> * Rik van Riel <riel@xxxxxxxxxx> [2015-06-16 10:39:13]:
>
> > On 06/16/2015 07:56 AM, Srikar Dronamraju wrote:
> > > This is consistent with all other load balancing instances where we
> > > absorb unfairness upto env->imbalance_pct. Absorbing unfairness upto
> > > env->imbalance_pct allows to pull and retain task to their preferred
> > > nodes.
> > >
> > > Signed-off-by: Srikar Dronamraju <srikar@xxxxxxxxxxxxxxxxxx>
> >
> > How does this work with other workloads, eg.
> > single instance SPECjbb2005, or two SPECjbb2005
> > instances on a four node system?
> >
> > Is the load still balanced evenly between nodes
> > with this patch?
> >
>
> Yes, I have looked at mpstat logs while running SPECjbb2005 for 1JVMper
> System, 2 JVMs per System and 4 JVMs per System and observed that the
> load spreading was similar with and without this patch.
>
> Also I have visualized using htop when running 0.5X (i.e 48 threads on
> 96 cpu system) cpu stress workloads to see that the spread is similar
> before and after the patch.
>
> Please let me know if there are any better ways to observe the
> spread. [...]

There are. I see you are using prehistoric tooling, but see the various NUMA
convergence latency measurement utilities in 'perf bench numa':

triton:~/tip> perf bench numa mem -h
# Running 'numa/mem' benchmark:

# Running main, "perf bench numa numa-mem -h"

usage: perf bench numa <options>

-p, --nr_proc <n> number of processes
-t, --nr_threads <n> number of threads per process
-G, --mb_global <MB> global memory (MBs)
-P, --mb_proc <MB> process memory (MBs)
-L, --mb_proc_locked <MB>
process serialized/locked memory access (MBs), <= process_memory
-T, --mb_thread <MB> thread memory (MBs)
-l, --nr_loops <n> max number of loops to run
-s, --nr_secs <n> max number of seconds to run
-u, --usleep <n> usecs to sleep per loop iteration
-R, --data_reads access the data via writes (can be mixed with -W)
-W, --data_writes access the data via writes (can be mixed with -R)
-B, --data_backwards access the data backwards as well
-Z, --data_zero_memset
access the data via glibc bzero only
-r, --data_rand_walk access the data with random (32bit LFSR) walk
-z, --init_zero bzero the initial allocations
-I, --init_random randomize the contents of the initial allocations
-0, --init_cpu0 do the initial allocations on CPU#0
-x, --perturb_secs <n>
perturb thread 0/0 every X secs, to test convergence stability
-d, --show_details Show details
-a, --all Run all tests in the suite
-H, --thp <n> MADV_NOHUGEPAGE < 0 < MADV_HUGEPAGE
-c, --show_convergence
show convergence details
-m, --measure_convergence
measure convergence latency
-q, --quiet quiet mode
-S, --serialize-startup
serialize thread startup
-C, --cpus <cpu[,cpu2,...cpuN]>
bind the first N tasks to these specific cpus (the rest is unbound)
-M, --memnodes <node[,node2,...nodeN]>
bind the first N tasks to these specific memory nodes (the rest is unbound)

'-m' will measure convergence.
'-c' will visualize it.
'--thp' can be used to turn hugepages on/off

For example you can create a 'numa02' work-alike by doing:

vega:~> cat numa02
#!/bin/bash

perf bench numa mem --no-data_rand_walk -p 1 -t 32 -G 0 -P 0 -T 32 -l 800 -zZ0c $@

this perf bench numa command mimics numa02 pretty exactly on a 32 CPU system.

This will run it in a loop:

vega:~> cat numa02-loop

while :; do
./numa02 2>&1 | grep runtime-max/thread
sleep 1
done

Or here are various numa01 work-alikes:

vega:~> cat numa01
perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 3072 -T 0 -l 50 -zZ0c $@

vega:~> cat numa01-hard-bind
./numa01 --cpus=0-16_16x16#16 --memnodes=0x16,2x16

or numa01-thread-alloc:

vega:~> cat numa01-THREAD_ALLOC

perf bench numa mem --no-data_rand_walk -p 2 -t 16 -G 0 -P 0 -T 192 -l 1000 -zZ0c $@

You can generate very flexible setups of NUMA access patterns, and measure their
behavior accurately.

It's all so much more capable and more flexible than autonumabench ...

Also, when you are trying to report numbers for multiple runs, please use
something like:

perf stat --null --repeat 3 ...

This will run the workload 3 times (doing only time measurement) and report the
stddev in a human readable form.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/