Re: [PATCH v3 3/3] vfs: Use per-cpu list for superblock's inode list

From: Waiman Long
Date: Thu Feb 25 2016 - 09:44:07 EST


On 02/25/2016 03:06 AM, Ingo Molnar wrote:
* Jan Kara<jack@xxxxxxx> wrote:

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a 4-socket
Intel E7-4820 v3 system (40 cores, 80 threads) were as follows:

Kernel Elapsed Time System Time
------ ------------ -----------
Vanilla 4.5-rc4 65.29s 82m14s
Patched 4.5-rc4 22.81s 23m03s

Before the patch, spinlock contention at the inode_sb_list_add() function
at the startup phase and the inode_sb_list_del() function at the exit
phase were about 79% and 93% of total CPU time respectively (as measured
by perf). After the patch, the percpu_list_add() function consumed only
about 0.04% of CPU time at startup phase. The percpu_list_del() function
consumed about 0.4% of CPU time at exit phase. There were still some
spinlock contention, but they happened elsewhere.
While looking through this patch, I have noticed that the
list_for_each_entry_safe() iterations in evict_inodes() and
invalidate_inodes() are actually unnecessary. So if you first apply the
attached patch, you don't have to implement safe iteration variants at all.

As a second comment, I'd note that this patch grows struct inode by 1
pointer. It is probably acceptable for large machines given the speedup but
it should be noted in the changelog. Furthermore for UP or even small SMP
systems this is IMHO undesired bloat since the speedup won't be noticeable.

So for these small systems it would be good if per-cpu list magic would just
fall back to single linked list with a spinlock. Do you think that is
reasonably doable?
Even many 'small' systems tend to be SMP these days.
Yes, I know. But my tablet with 4 ARM cores is unlikely to benefit from this
change either. [...]
I'm not sure about that at all, the above numbers are showing a 3x-4x speedup in
system time, which ought to be noticeable on smaller SMP systems as well.

Waiman, could you please post the microbenchmark?

Thanks,

Ingo

The microbenchmark that I used is attached.

I do agree that performance benefit will decrease as the number of CPUs get smaller. The system that I used for testing have 4 sockets with 40 cores (80 threads). Dave Chinner had run his fstests on a 16-core system (probably 2-socket) which showed modest improvement in performance (~4m40s vs 4m30s in runtime).

This patch enables parallel insertion and deletion to/from the inode list which used to be a serialized operation. So if that list operation is a bottleneck, you will see significant improvement. If it is not, we may not notice that much of a difference. For a single-socket 4-core system, I agree that the performance benefit, if any, will be limited.

Cheers,
Longman

/*
* This program is free software; you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation; either version 2 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* Authors: Waiman Long <waiman.long@xxxxxx>
*/
/*
* This is an exit test
*/
#include <ctype.h>
#include <errno.h>
#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <signal.h>
#include <sys/types.h>
#include <sys/syscall.h>


#define do_exit() syscall(SYS_exit)
#define gettid() syscall(SYS_gettid)
#define MAX_THREADS 2048

static inline void cpu_relax(void)
{
__asm__ __volatile__("rep;nop": : :"memory");
}

static inline void atomic_inc(volatile int *v)
{
__asm__ __volatile__("lock incl %0": "+m" (*v));
}

static volatile int exit_now = 0;
static volatile int threadcnt = 0;

/*
* Walk the /proc/<pid> filesystem to make them fill the dentry cache
*/
static void walk_procfs(void)
{
char cmdbuf[256];
pid_t tid = gettid();

snprintf(cmdbuf, sizeof(cmdbuf), "find /proc/%d > /dev/null 2>&1", tid);
if (system(cmdbuf) < 0)
perror("system() failed!");
}

static void *exit_thread(void *dummy)
{
long tid = (long)dummy;

walk_procfs();
atomic_inc(&threadcnt);
/*
* Busy wait until the do_exit flag is set and then call exit
*/
while (!exit_now)
sleep(1);
do_exit();
}

static void exit_test(int threads)
{
pthread_t thread[threads];
long i = 0, finish;
time_t start = time(NULL);

while (i++ < threads) {
if (pthread_create(thread + i - 1, NULL, exit_thread,
(void *)i)) {
perror("pthread_create");
exit(1);
}
#if 0
/*
* Pipelining to reduce contention & improve speed
*/
if ((i & 0xf) == 0)
while (i - threadcnt > 12)
usleep(1);
#endif
}
while (threadcnt != threads)
usleep(1);
walk_procfs();
printf("Setup time = %lus\n", time(NULL) - start);
printf("Process ready to exit!\n");
kill(0, SIGKILL);
exit(0);
}

int main(int argc, char *argv[])
{
int tcnt; /* Thread counts */
char *cmd = argv[0];

if ((argc != 2) || !isdigit(argv[1][0])) {
fprintf(stderr, "Usage: %s <thread count>\n", cmd);
exit(1);
}
tcnt = strtoul(argv[1], NULL, 10);
if (tcnt > MAX_THREADS) {
fprintf(stderr, "Error: thread count should be <= %d\n",
MAX_THREADS);
exit(1);
}
exit_test(tcnt);
return 0; /* Not reaachable */
}