Re: [PATCH] xfs: optimise xfs_mod_icount/ifree when delta < 0

From: Shaokun Zhang
Date: Wed Nov 06 2019 - 01:01:11 EST


Hi Dave,

On 2019/11/5 12:03, Dave Chinner wrote:
> On Tue, Nov 05, 2019 at 11:26:32AM +0800, Shaokun Zhang wrote:
>> Hi Dave,
>>
>> On 2019/11/5 4:49, Dave Chinner wrote:
>>> On Mon, Nov 04, 2019 at 07:29:40PM +0800, Shaokun Zhang wrote:
>>>> From: Yang Guo <guoyang2@xxxxxxxxxx>
>>>>
>>>> percpu_counter_compare will be called by xfs_mod_icount/ifree to check
>>>> whether the counter less than 0 and it is a expensive function.
>>>> let's check it only when delta < 0, it will be good for xfs's performance.
>>>
>>> Hmmm. I don't recall this as being expensive.
>>>
>>
>> Sorry about the misunderstanding information in commit message.
>>
>>> How did you find this? Can you please always document how you found
>>
>> If user creates million of files and the delete them, We found that the
>> __percpu_counter_compare costed 5.78% CPU usage, you are right that itself
>> is not expensive, but it calls __percpu_counter_sum which will use
>> spin_lock and read other cpu's count. perf record -g is used to profile it:
>>
>> - 5.88% 0.02% rm [kernel.vmlinux] [k] xfs_mod_ifree
>> - 5.86% xfs_mod_ifree
>> - 5.78% __percpu_counter_compare
>> 5.61% __percpu_counter_sum
>
> Interesting. Your workload is hitting the slow path, which I most
> certainly do no see when creating lots of files. What's your
> workload?
>

The hardware has 128 cpu cores, and the xfs filesystem format config is default,
while the test is a single thread, as follow:
./mdtest -I 10 -z 6 -b 8 -d /mnt/ -t -c 2

xfs info:
meta-data=/dev/bcache2 isize=512 agcount=4, agsize=244188661 blks
= sectsz=512 attr=2, projid32bit=1
= crc=1 finobt=1 spinodes=1 rmapbt=0
= reflink=0
data = bsize=4096 blocks=976754644, imaxpct=5
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal bsize=4096 blocks=476930, version=2
= sectsz=512 sunit=0 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

disk info:
Disk /dev/bcache2: 4000.8 GB, 4000787021824 bytes, 7814037152 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes

>>> IOWs, we typically measure the overhead of such functions by kernel
>>> profile. Creating ~200,000 inodes a second, so hammering the icount
>>> and ifree counters, I see:
>>>
>>> 0.16% [kernel] [k] percpu_counter_add_batch
>>> 0.03% [kernel] [k] __percpu_counter_compare
>>>
>>
>> 0.03% is just __percpu_counter_compare's usage.
>
> No, that's the total of _all_ the percpu counter functions captured
> by the profile - it was the list of all samples filtered by
> "percpu". I just re-ran the profile again, and got:
>
>
> 0.23% [kernel] [k] percpu_counter_add_batch
> 0.04% [kernel] [k] __percpu_counter_compare
> 0.00% [kernel] [k] collect_percpu_times
> 0.00% [kernel] [k] __handle_irq_event_percpu
> 0.00% [kernel] [k] __percpu_counter_sum
> 0.00% [kernel] [k] handle_irq_event_percpu
> 0.00% [kernel] [k] fprop_reflect_period_percpu.isra.0
> 0.00% [kernel] [k] percpu_ref_switch_to_atomic_rcu
> 0.00% [kernel] [k] free_percpu
> 0.00% [kernel] [k] percpu_ref_exit
>
> So you can see that this essentially no samples in
> __percpu_counter_sum at all - my tests are not hitting the slow path
> at all, despite allocating inodes continuously.

Got it,

>
> IOWs, your workload is hitting the slow path repeatedly, and so the
> question that needs to be answered is "why is the slow path actually
> being exercised?". IOWs, we need to know what your workload is, what
> the filesystem config is, what hardware (cpus, storage, etc) you are
> running on, etc. There must be some reason for the slow path being
> used, and that's what we need to understand first before deciding
> what the best fix might be...
>
> I suspect that you are only running one or two threads creating

Yeah, we just run one thread test.

> files and you have lots of idle CPU and hence the inode allocation
> is not clearing the fast path batch threshold on the ifree counter.
> And because you have lots of CPUs, the cost of a sum is very
> expensive compared to running single threaded creates. That's my
> current hypothesis based what I see on my workloads that
> xfs_mod_ifree overhead goes down as concurrency goes up....
>

Agree, we add some debug info in xfs_mod_ifree and found most times
m_ifree.count < batch * num_online_cpus(), because we have 128 online
cpus and m_ifree.count around 999.


> FWIW, the profiles I took came from running this on 16 and 32p
> machines:
>
> --
> dirs=""
> for i in `seq 1 $THREADS`; do
> dirs="$dirs -d /mnt/scratch/$i"
> done
>
> cycles=$((512 / $THREADS))
>
> time ./fs_mark $XATTR -D 10000 -S0 -n $NFILES -s 0 -L $cycles $dirs
> --
>
> With THREADS=16 or 32 and NFILES=100000 on a big sparse filesystem
> image:
>
> meta-data=/dev/vdc isize=512 agcount=500, agsize=268435455 blks
> = sectsz=512 attr=2, projid32bit=1
> = crc=1 finobt=1, sparse=1, rmapbt=0
> = reflink=1
> data = bsize=4096 blocks=134217727500, imaxpct=1
> = sunit=0 swidth=0 blks
> naming =version 2 bsize=4096 ascii-ci=0, ftype=1
> log =internal log bsize=4096 blocks=521728, version=2
> = sectsz=512 sunit=0 blks, lazy-count=1
> realtime =none extsz=4096 blocks=0, rtextents=0
>
> That's allocating enough inodes to keep the free inode counter
> entirely out of the slow path...

percpu_counter_read that reads the count will cause cache synchronization
cost if other cpu changes the count, Maybe it's better not to call
percpu_counter_compare if possible.

Thanks,
Shaokun

>
> Cheers,
>
> Dave.
>