NUMA regression(?) on 32core shanghai

From: Martin Vogt
Date: Thu Jun 04 2009 - 09:27:58 EST



Hello,

I have strange/unexpected benchmark results for my numa machine
a 32 cores shanghai system with 512GB RAM.


My benchmark shows varying runtimes up to factor 12(!) for identical
tests and I think this is a bug somewhere.

I have tested the following kernels:

-2.6.30-rc8,2.6.29.4 and SLES10-SP1 kernel

All have the same problems for 16/32 threads in the first run.
(but not always!)
For example 2.6.30-rc8:

16-1: 33.403038s 28.906326s <<-- strange values
16-2: 5.444921s 5.072422s
16-3: 6.266797s 6.152743s


This is why I think this is a bug:
----------------------------------

My understanding of the NUMA memory bandwitdh test is:

- if I attach 8 threads to one numa node
- and allocate for each thread 512MB local memory

THEN:
- the runtime should be near constant over all nodes for all runs
(for example: every thread runs 3 seconds)


If I now double the threads (16 threads, 2 on each numa node)
then:
- the the runtime should double too.
(for example: 6 seconds instead of three)

and so on, for 32 threads 12 seconds etc...

The machine behaves sometimes as expected, but for the
16/32 threads case it usually has these strange runtimes in the first run.
(But this can happen for the 8 thread test too)

What is wrong with this?
(a factor of 12 slower for old kernels, and factor ~4 for newer)

There must be something wrong with this.
How can I debug it?


regards,

Martin

PS: on a smaller opteron numa system 4 nodes a 2 cores with
8GB on each node the test program works as expected.

PPS: the "bug" does not happens always, but very often with 16/32 threads
and: the behaviour is the same if I replace numa_alloc_onnode with malloc

Benchmark:
- cron is off/HZ is 100/libc 2.4-31.43.7 from SLES10
- Format example:
08-1: 3.405676 3.023264
8 threads, first run, read took 3.4 seconds and write 3.0 secs.


2.6.30-rc8
=====================
04-1: 3.591044 3.295444
04-2: 3.588437 3.280143
04-3: 3.448116 2.995627

08-1: 4.122432 3.566830
08-2: 4.119241 3.548015
08-3: 3.819517 3.349197

16-1: 33.403038 28.906326 <<-- strange values
16-2: 5.444921 5.072422
16-3: 6.266797 6.152743

32-1: 49.885150 76.500259 <<-- strange values
32-2: 19.114738 12.170802
32-3: 14.807441 11.064564


2.6.29.4
==================
04-1: 3.375012 3.057332
04-2: 3.401835 3.039497
04-3: 3.359395 2.980974

08-1: 3.405676 3.023264
08-2: 3.257743 3.000751
08-3: 3.129684 2.886261

16-1: 22.417126 11.807065 <<-- strange values
16-2: 6.031583 5.098305
16-3: 5.088144 5.457238

32-1: 45.829553 24.225427 <<-- strange values
32-2: 13.165044 12.290732
32-3: 8.908012 11.622502

2.6.16 (SuSE SLES10-SP1)+perfctr
================================
(Seconds: it was take the slowest thread)


#Thread-run read in secs write in secs
04-1: 3.375012 3.057332
04-2: 3.401835 3.039497
04-3: 3.359395 2.980974

08-1: 3.405676 3.023264
08-2: 3.257743 3.000751
08-3: 3.129684 2.886261

16-1: 74.399871 12.747340 <<-- strange values
16-2: 7.449596 4.401576
16-3: 6.123250 5.518968

32-1: 150.927981 55.032012 <<-- strange values
32-2: 12.119996 12.203303
32-3: 11.601377 12.485716


Attachment: mbind2.cpp.gz
Description: GNU Zip compressed data