Re: [RFC PATCH 00/20] Introduce the famfs shared-memory file system

From: John Groves
Date: Thu Feb 29 2024 - 09:53:05 EST



Hi Dave!

On 24/02/29 01:15PM, Dave Chinner wrote:
> On Mon, Feb 26, 2024 at 08:05:58PM -0600, John Groves wrote:
> > On 24/02/26 04:58PM, Luis Chamberlain wrote:
> > > On Mon, Feb 26, 2024 at 1:16 PM John Groves <John@xxxxxxxxxx> wrote:
> > > >
> > > > On 24/02/26 07:53AM, Luis Chamberlain wrote:
> > > > > On Mon, Feb 26, 2024 at 07:27:18AM -0600, John Groves wrote:
> > > > > > Run status group 0 (all jobs):
> > > > > > WRITE: bw=29.6GiB/s (31.8GB/s), 29.6GiB/s-29.6GiB/s (31.8GB/s-31.8GB/s), io=44.7GiB (48.0GB), run=1511-1511msec
> > > > >
> > > > > > This is run on an xfs file system on a SATA ssd.
> > > > >
> > > > > To compare more closer apples to apples, wouldn't it make more sense
> > > > > to try this with XFS on pmem (with fio -direct=1)?
> > > > >
> > > > > Luis
> > > >
> > > > Makes sense. Here is the same command line I used with xfs before, but
> > > > now it's on /dev/pmem0 (the same 128G, but converted from devdax to pmem
> > > > because xfs requires that.
> > > >
> > > > fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=none --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
> > >
> > > Could you try with mkfs.xfs -d agcount=1024
>
> Won't change anything for the better, may make things worse.

I dropped that arg, though performance looked about the same either way.

>
> > bw ( MiB/s): min= 5085, max=27367, per=100.00%, avg=14361.95, stdev=165.61, samples=719
> > iops : min= 2516, max=13670, avg=7160.17, stdev=82.88, samples=719
> > lat (usec) : 4=0.05%, 10=0.72%, 20=2.23%, 50=2.48%, 100=3.02%
> > lat (usec) : 250=1.54%, 500=2.37%, 750=1.34%, 1000=0.75%
> > lat (msec) : 2=3.20%, 4=43.10%, 10=23.05%, 20=14.81%, 50=1.25%
>
> Most of the IO latencies are up round the 4-20ms marks. That seems
> kinda high for a 2MB IO. With a memcpy speed of 10GB/s, the 2MB
> should only take a couple of hundred microseconds. For Famfs, the
> latencies appear to be around 1-4ms.
>
> So where's all that extra time coming from?

Below, you will see two runs with performance and latency distribution
about the same as famfs (the answer for that was --fallocate=native).

>
>
> > lat (msec) : 100=0.08%
> > cpu : usr=10.18%, sys=0.79%, ctx=67227, majf=0, minf=38511
>
> And why is system time reporting at almost zero instead of almost
> all the remaining cpu time (i.e. up at 80-90%)?

Something weird is going on with the cpu reporting. Sometimes sys=~0, but other times
it's about what you would expect. I suspect some sort of measurement error,
like maybe the method doesn't work with my cpu model? (I'm grasping, but with
a somewhat rational basis...)

I pasted two xfs runs below. The first has the wonky cpu sys value, and
the second looks about like what one would expect.

>
> Can you run call-graph kernel profiles for XFS and famfs whilst
> running this workload so we have some insight into what is behaving
> differently here?

Can you point me to an example of how to do that?

>
> -Dave.
> --
> Dave Chinner
> david@xxxxxxxxxxxxx


I'd been thinking about the ~2x gap for a few days, and the most obvious
difference is famfs files must be preallocated (like fallocate, but works
a bit differently since allocation happens in user space). I just checked
one of the xfs files, and it had maybe 80 extents (whereas the famfs
files always have 1 extent here).

FWIW I ran xfs with and without io_uring, and there was no apparent
difference (which makes sense to me because it's not block I/O).

The prior ~2x gap still seems like a lot of overhead for extent list
mapping to memory, but adding --fallocate=native to the xfs test brought
it into line with famfs:


+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --ioengine=io_uring --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=io_uring, iodepth=1
..
fio-3.33
Starting 48 processes
Jobs: 38 (f=380): [W(5),_(1),W(12),_(1),W(3),_(1),W(2),_(1),W(2),_(1),W(1),_(1),W(1),_(1),W(6),_(1),W(6),_(2)][57.1%][w=28.0GiB/s][w=14.3k IOPS][eta 00m:03s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1452590: Thu Feb 29 07:46:06 2024
write: IOPS=15.3k, BW=29.8GiB/s (32.0GB/s)(114GiB/3838msec); 0 zone resets
slat (usec): min=17, max=55364, avg=668.20, stdev=1120.41
clat (nsec): min=1368, max=99619k, avg=1982477.32, stdev=2198309.32
lat (usec): min=179, max=99813, avg=2650.68, stdev=2485.15
clat percentiles (usec):
| 1.00th=[ 4], 5.00th=[ 14], 10.00th=[ 172], 20.00th=[ 420],
| 30.00th=[ 644], 40.00th=[ 1057], 50.00th=[ 1582], 60.00th=[ 2008],
| 70.00th=[ 2343], 80.00th=[ 3097], 90.00th=[ 4555], 95.00th=[ 5473],
| 99.00th=[ 8717], 99.50th=[11863], 99.90th=[20055], 99.95th=[27657],
| 99.99th=[49546]
bw ( MiB/s): min=20095, max=59216, per=100.00%, avg=35985.47, stdev=318.61, samples=280
iops : min=10031, max=29587, avg=17970.76, stdev=159.29, samples=280
lat (usec) : 2=0.06%, 4=1.02%, 10=2.33%, 20=4.29%, 50=1.85%
lat (usec) : 100=0.20%, 250=3.26%, 500=11.23%, 750=8.87%, 1000=5.82%
lat (msec) : 2=20.95%, 4=26.74%, 10=12.60%, 20=0.66%, 50=0.09%
lat (msec) : 100=0.01%
cpu : usr=15.48%, sys=1.17%, ctx=62654, majf=0, minf=22801
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=29.8GiB/s (32.0GB/s), 29.8GiB/s-29.8GiB/s (32.0GB/s-32.0GB/s), io=114GiB (123GB), run=3838-3838msec

Disk stats (read/write):
pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


## Here is a run where the cpu looks "normal"

+ fio -name=ten-256m-per-thread --nrfiles=10 -bs=2M --group_reporting=1 --alloc-size=1048576 --filesize=256MiB --readwrite=write --fallocate=native --numjobs=48 --create_on_open=0 --direct=1 --directory=/mnt/xfs
ten-256m-per-thread: (g=0): rw=write, bs=(R) 2048KiB-2048KiB, (W) 2048KiB-2048KiB, (T) 2048KiB-2048KiB, ioengine=psync, iodepth=1
..
fio-3.33
Starting 48 processes
Jobs: 19 (f=190): [W(2),_(1),W(2),_(8),W(1),_(3),W(1),_(1),W(2),_(2),W(1),_(1),W(3),_(2),W(1),_(1),W(1),_(2),W(2),_(7),W(3),_(1)][55.6%][w=26.7GiB/s][w=13.6k IOPS][eta 00m:04s]
ten-256m-per-thread: (groupid=0, jobs=48): err= 0: pid=1463615: Thu Feb 29 08:19:53 2024
write: IOPS=12.4k, BW=24.1GiB/s (25.9GB/s)(114GiB/4736msec); 0 zone resets
clat (usec): min=138, max=117903, avg=2581.99, stdev=2704.61
lat (usec): min=152, max=120405, avg=3019.04, stdev=2964.47
clat percentiles (usec):
| 1.00th=[ 161], 5.00th=[ 249], 10.00th=[ 627], 20.00th=[ 1270],
| 30.00th=[ 1631], 40.00th=[ 1942], 50.00th=[ 2089], 60.00th=[ 2212],
| 70.00th=[ 2343], 80.00th=[ 2704], 90.00th=[ 5866], 95.00th=[ 6849],
| 99.00th=[12387], 99.50th=[14353], 99.90th=[26084], 99.95th=[38536],
| 99.99th=[78119]
bw ( MiB/s): min=21204, max=47040, per=100.00%, avg=29005.40, stdev=237.31, samples=329
iops : min=10577, max=23497, avg=14479.74, stdev=118.65, samples=329
lat (usec) : 250=5.04%, 500=4.03%, 750=2.37%, 1000=3.13%
lat (msec) : 2=29.39%, 4=41.05%, 10=13.37%, 20=1.45%, 50=0.15%
lat (msec) : 100=0.03%, 250=0.01%
cpu : usr=14.43%, sys=78.18%, ctx=5272, majf=0, minf=15708
IO depths : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
complete : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
issued rwts: total=0,58560,0,0 short=0,0,0,0 dropped=0,0,0,0
latency : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
WRITE: bw=24.1GiB/s (25.9GB/s), 24.1GiB/s-24.1GiB/s (25.9GB/s-25.9GB/s), io=114GiB (123GB), run=4736-4736msec

Disk stats (read/write):
pmem0: ios=0/0, merge=0/0, ticks=0/0, in_queue=0, util=0.00%


Cheers,
John