On 02/24/2016 03:28 AM, Jan Kara wrote:
On Tue 23-02-16 14:04:32, Waiman Long wrote:
When many threads are trying to add or delete inode to or fromWhile looking through this patch, I have noticed that the
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.
This patch changes the s_inodes field to become a per-cpu list with
per-cpu spinlocks. As a result, the following superblock inode list
(sb->s_inodes) iteration functions in vfs are also being modified:
1. iterate_bdevs()
2. drop_pagecache_sb()
3. wait_sb_inodes()
4. evict_inodes()
5. invalidate_inodes()
6. fsnotify_unmount_inodes()
7. add_dquot_ref()
8. remove_dquot_ref()
With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them and then exits. The runtimes of that
microbenchmark with 1000 threads before and after the patch on a
4-socket Intel E7-4820 v3 system (40 cores, 80 threads) were as
follows:
Kernel Elapsed Time System Time
------ ------------ -----------
Vanilla 4.5-rc4 65.29s 82m14s
Patched 4.5-rc4 22.81s 23m03s
Before the patch, spinlock contention at the inode_sb_list_add()
function at the startup phase and the inode_sb_list_del() function at
the exit phase were about 79% and 93% of total CPU time respectively
(as measured by perf). After the patch, the percpu_list_add()
function consumed only about 0.04% of CPU time at startup phase. The
percpu_list_del() function consumed about 0.4% of CPU time at exit
phase. There were still some spinlock contention, but they happened
elsewhere.
list_for_each_entry_safe() iterations in evict_inodes() and
invalidate_inodes() are actually unnecessary. So if you first apply the
attached patch, you don't have to implement safe iteration variants at all.
Thank for the patch. I will apply that in my next update. As for the safe iteration variant, I think I will keep it since I had implemented that already just in case it may be needed in some other places.
As a second comment, I'd note that this patch grows struct inode by 1
pointer. It is probably acceptable for large machines given the speedup but
it should be noted in the changelog. Furthermore for UP or even small SMP
systems this is IMHO undesired bloat since the speedup won't be noticeable.
So for these small systems it would be good if per-cpu list magic would just
fall back to single linked list with a spinlock. Do you think that is
reasonably doable?
I already have a somewhat separate code path for UP. So I can remove the lock pointer for that. For small SMP system, however, the only way to avoid the extra pointer is to add a config parameter to turn this feature off. That can be added as a separate patch, if necessary.