Re: [4.2, Regression] Queued spinlocks cause major XFS performance regression

From: Peter Zijlstra
Date: Fri Sep 04 2015 - 07:32:47 EST

On Fri, Sep 04, 2015 at 06:12:34PM +1000, Dave Chinner wrote:
> You probably don't even need a VM to reproduce it - that would
> certainly be an interesting counterpoint if it didn't....

Even though you managed to restore your DEBUG_SPINLOCK performance by
changing virt_queued_spin_lock() to use __delay(1), I ran the thing on
actual hardware just to test.

[ Note: In any case, I would recommend you use (or at least try)
PARAVIRT_SPINLOCKS if you use VMs, as that is where we were looking for
performance, the test-and-set fallback really wasn't meant as a
performance option (although it clearly sucks worse than expected).

Pre qspinlock, your setup would have used regular ticket locks on
vCPUs, which mostly works as long as there is almost no vCPU
preemption, if you overload your machine such that the vCPU threads
get preempted that will implode into silly-land. ]

So on to native performance:

- IVB-EX, 4-socket, 15 core, hyperthreaded, for a total of 120 CPUs
- 1.1T of md-stripe (5x200GB) SSDs
- Linux v4.2 (distro style .config)
- Debian "testing" base system
- xfsprogs v3.2.1

# mkfs.xfs -f -m "crc=1,finobt=1" /dev/md0
log stripe unit (524288 bytes) is too large (maximum is 256KiB)
log stripe unit adjusted to 32KiB
meta-data=/dev/md0 isize=512 agcount=32, agsize=9157504 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1
data = bsize=4096 blocks=293038720, imaxpct=5
= sunit=128 swidth=640 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=143088, version=2
= sectsz=512 sunit=8 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

# mount -o logbsize=262144,nobarrier /dev/md0 /mnt/scratch

# ./fs_mark -D 10000 -S0 -n 50000 -s 0 -L 32 \
-d /mnt/scratch/0 -d /mnt/scratch/1 \
-d /mnt/scratch/2 -d /mnt/scratch/3 \
-d /mnt/scratch/4 -d /mnt/scratch/5 \
-d /mnt/scratch/6 -d /mnt/scratch/7 \
-d /mnt/scratch/8 -d /mnt/scratch/9 \
-d /mnt/scratch/10 -d /mnt/scratch/11 \
-d /mnt/scratch/12 -d /mnt/scratch/13 \
-d /mnt/scratch/14 -d /mnt/scratch/15 \

Regular v4.2 (qspinlock) does:

0 6400000 0 286491.9 3500179
0 7200000 0 293229.5 3963140
0 8000000 0 271182.4 3708212
0 8800000 0 300592.0 3595722

Modified v4.2 (ticket) does:

0 6400000 0 310419.6 3343821
0 7200000 0 348346.5 4721133
0 8000000 0 328098.2 3235753
0 8800000 0 316765.3 3238971

Which shows that qspinlock is clearly slower, even for these large-ish
NUMA boxes where it was supposed to be better.

Clearly our benchmarks used before this were not sufficient, and more
works needs to be done.

Also, I note that after running to completion, there is only 14G of
actual data on the device, so you don't need silly large storage to run
this -- I expect your previous 275G quote was due to XFS populating the
sparse file with meta-data or something along those lines.

Further note, rm -rf /mnt/scratch0/*, takes for bloody ever :-)
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at