[RFC 00/11] multiblock allocator improvements

From: Ojaswin Mujoo
Date: Fri Jan 27 2023 - 07:38:04 EST

Next message: Ojaswin Mujoo: "[RFC 02/11] ext4: Remove unused extern variables declaration"
Previous message: Ojaswin Mujoo: "[RFC 01/11] ext4: mballoc: Remove useless setting of ac_criteria"
Next in thread: Ojaswin Mujoo: "[RFC 01/11] ext4: mballoc: Remove useless setting of ac_criteria"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

This patchset intends to improve some of the shortcomings of mb allocator
that we had noticed while running various tests and workloads in a
POWERPC machine with 64k block size.

** Problems **

More specifically, we were seeing a sharp drop in performance when the
FS was highly fragmented (64K bs). We noticed that:

Problem 1: prefetch logic seemed to be skipping BLOCK_UNINIT groups
which was resulting in buddy and CR0/1 cache not being initialized for
these even though it could be done without any IO. (Not sure if there
was any history behind this design, do let me know if so).

Problem 2: With a 64K bs FS, we were commonly seeing cases where CR1
would correctly identify a good group but due to very high
fragmentation, complex scan would exit early due to ac->ac_found >
s_mb_max_to_scan, resulting in trimming of the allocated len.

Problem 3: Even though our avg free extent was say 4MB and original
request was merely 1 block of data, mballoc noramlization kept adding
PAs and requesting 8MB chunks. This led to almost all the requests
falling into slower CR 2 and with increased threads, we started seeing
lots of CR3 requests as well.

** How did we address them **

Problem 1 (Patch 8,9): Make ext4_mb_prefetch also call
ext4_read_block_bitmap_nowait() in case of BLOCK_UNINIT, so it can init
the BG and exit early without an IO. Next, fix the calls to
prefetch_fini so these newly init BGs can have their buddy initialised.

Problem 2 (Patch 7): When we come to complex_scan after CR1, my
understanding is that due to free/frag > goal_len, we can be sure that
there is atleast one chunk big enough to accomodate the goal request.
Hence, we can skip the overhead of mb_find_extent() other accounting for
each free extent and just process extents that are big enough.

Problem 3 (Patch 11): To solve this problem, this patchset implements a
new allocation criteria (CR1.5 or CR1_5 in code). The idea is that if
CR1 fails to find a BG, it will jump to CR1.5. Here the flow is as
follows:

* We make an assumption that if CR1 has failed that means none of the
currently cached BGs have a big enough continuous extent to satisfy
our request In this case we fall to CR1.5.

* In CR 1.5, we find the highest available free/frag BGs (from CR1
lists) and trim the PAs to this order so that we can find
a BG without IO overhead of CR2.

* Parallely, prefetch will get in more groups in memory, and as more
and more groups are cached, CR1.5 becomes a better replacement of
CR2. This is because, for example, if all BGs are cahced and we
couldn't find anything in CR0/1, we can assume that no BG has a big
enough continuous free extent and hence CR1.5 can directly trim and
find the next biggest extent we could get. In this scenario, without
CR1.5, we would have continued scanning in CR2 which would have
most probably trimmed the request after scanning for ~200 extents.

CR1.5 results in improved allocation speed at the cost of slightly increased
trimming of the len of blocks allocated.

** Performance Numbers **

Unless stated otherwise, these numbers are from fsmark and fio tests with 64k
BS, 64K pagesize on 100Gi nvme0n1 with nodelalloc. There tests were performed
after the FS was fragmented till Avg Fragment Size == 4MB.

* Test 1: Writing ~40000 files of 64K each in a single directory (64 threads, fsmark)
* Test 2: Same as Test 1 on a 500GiB pmem device with dax
* Test 3: 5Gi write with mix of random and seq writes (fio)
* Test 4: 5Gi sequential writes (fio)

Here:
e = extents scanned
c = cr0 / cr1 / cr1.5 / cr2 / cr3 hits

+─────────+───────────────────────────────────+────────────────────────────────+
| | Unpatched | Patched |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 1 | 6866 files/s | 13527 files/s |
| | e: 8,188,644 | e: 1,719,725 |
| | c: 381 / 330 / - / 4779 / 35534 | c: 381/ 280 / 33299/ 1000/ 6064|
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 2 | 6927 files/s | 8422 files/s |
| | e: 8,055,911 | e: 261,268 |
| | cr: 1011 / 999 / - / 6153 / 32861 | c: 1721 / 1210 / 38093 / 0 / 0 |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 3 | 387 MiB/s | 443 MiB/s |
+─────────+───────────────────────────────────+────────────────────────────────+
| Test 4 | 3139 MiB/s | 3180 MiB/s |
+─────────+───────────────────────────────────+────────────────────────────────+

The numbers of same tests with 4k bs 64k pagesize are:

+─────────+────────────────────────────────────+────────────────────────────────+
| | Unpatched | Patched |
+─────────+────────────────────────────────────+────────────────────────────────+
| Test 1 | 21618 files/s | 23528 files/s |
| | e: 8,149,272 | e: 223,013 |
| | c: 34 / 1380 / - / 5624 / 34710 | 34 / 1341 / 40387 / 0 / 0 |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 2 | 30739 files/s | 30946 files/s |
| | e: 7,742,853 | e: 2,176,475 |
| | c: 1131 / 2244 / - / 3914 / 34468 | c: 1596/1079/28425/1098/8547 |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 3 | 200 MiB/s | 186MiB/s |
+─────────+───────────────────────────────────+─────────────────────────────────+
| Test 4 | 621 MiB/s | 632 MiB/s |
+─────────+────────────────────────────────────+────────────────────────────────+

** Some Observations **

1. In the case of highly fragmented 64k blocksize most of the performance is
lost since we hold the BG lock while scanning a block group for best extent.
As our goal len is 8MB and we only have 4MB blocks, we are taking a long time
to scan causing other threads to wait on the BG lock. This can be seen in perf
diff of unpatched vs patched:

83.14% -24.89% [kernel.vmlinux] [k] do_raw_spin_lock

Using lockstat and perf call graph I was able to confirm that this lock was the
BG lock taken in ext4_mb_regular_allocator, contending with other processes trying
to take the same BG's lock in ext4_mb_regular_allocator() and __ext4_new_inode()

2. Currently, I do see some increase in fragmentation. Below are the
e2freefrag results after Test 1 with 64k BS:

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Unpatched:

Min. free extent: 128 KB
Max. free extent: 8000 KB
Avg. free extent: 4096 KB
Num. free extent: 12630

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
128K... 256K- : 1 2 0.00%
256K... 512K- : 1 6 0.00%
512K... 1024K- : 4 48 0.01%
1M... 2M- : 5 120 0.01%
2M... 4M- : 11947 725624 85.31%
4M... 8M- : 672 83796 9.85%

Patched:

Min. free extent: 64 KB
Max. free extent: 11648 KB
Avg. free extent: 2688 KB
Num. free extent: 18847

HISTOGRAM OF FREE EXTENT SIZES:
Extent Size Range : Free extents Free Blocks Percent
64K... 128K- : 1 1 0.00%
128K... 256K- : 2 5 0.00%
256K... 512K- : 1 5 0.00%
512K... 1024K- : 297 3909 0.48%
1M... 2M- : 11221 341065 42.13%
2M... 4M- : 4940 294260 36.35%
4M... 8M- : 2384 170169 21.02%
8M... 16M- : 1 182 0.02%

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

3. I was hoping to get some feedback on enabling prefetch of BLOCK_UNINIT
BGs and any history on why we disabled it.

-------------------------------------

Since these changes are looking good to me from my end, so posting for a
feedback from ext4 community.

(gcexfstests -c all quick went fine with no new failures reported)

Any thoughts/suggestions are welcome!!

Regards,
Ojaswin

Ojaswin Mujoo (8):
ext4: Convert mballoc cr (criteria) to enum
ext4: Add per CR extent scanned counter
ext4: Add counter to track successful allocation of goal length
ext4: Avoid scanning smaller extents in BG during CR1
ext4: Don't skip prefetching BLOCK_UNINIT groups
ext4: Ensure ext4_mb_prefetch_fini() is called for all prefetched BGs
ext4: Abstract out logic to search average fragment list
ext4: Add allocation criteria 1.5 (CR1_5)

Ritesh Harjani (IBM) (3):
ext4: mballoc: Remove useless setting of ac_criteria
ext4: Remove unused extern variables declaration
ext4: mballoc: Fix getting the right group desc in
ext4_mb_prefetch_fini

fs/ext4/ext4.h | 23 +++-
fs/ext4/mballoc.c | 284 +++++++++++++++++++++++++++++++++-------------
fs/ext4/mballoc.h | 27 ++++-
fs/ext4/super.c | 11 +-
fs/ext4/sysfs.c | 2 +
5 files changed, 255 insertions(+), 92 deletions(-)

--
2.31.1

Next message: Ojaswin Mujoo: "[RFC 02/11] ext4: Remove unused extern variables declaration"
Previous message: Ojaswin Mujoo: "[RFC 01/11] ext4: mballoc: Remove useless setting of ac_criteria"
Next in thread: Ojaswin Mujoo: "[RFC 01/11] ext4: mballoc: Remove useless setting of ac_criteria"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]