On Mon 30-06-25 14:50:30, Baokun Li wrote:
On 2025/6/28 2:31, Jan Kara wrote:Indeed, CONFIG_FORCE_NR_CPUS confused me.
On Mon 23-06-25 15:32:52, Baokun Li wrote:nr_cpu_ids is generally equal to num_possible_cpus(). Only when
When allocating data blocks, if the first try (goal allocation) fails and...
stream allocation is on, it tries a global goal starting from the last
group we used (s_mb_last_group). This helps cluster large files together
to reduce free space fragmentation, and the data block contiguity also
accelerates write-back to disk.
However, when multiple processes allocate blocks, having just one global
goal means they all fight over the same group. This drastically lowers
the chances of extents merging and leads to much worse file fragmentation.
To mitigate this multi-process contention, we now employ multiple global
goals, with the number of goals being the CPU count rounded up to the
nearest power of 2. To ensure a consistent goal for each inode, we select
the corresponding goal by taking the inode number modulo the total number
of goals.
Performance test data follows:
Test: Running will-it-scale/fallocate2 on CPU-bound containers.
Observation: Average fallocate operations per container per second.
| Kunpeng 920 / 512GB -P80| AMD 9654 / 1536GB -P96 |
Disk: 960GB SSD |-------------------------|-------------------------|
| base | patched | base | patched |
-------------------|-------|-----------------|-------|-----------------|
mb_optimize_scan=0 | 7612 | 19699 (+158%) | 21647 | 53093 (+145%) |
mb_optimize_scan=1 | 7568 | 9862 (+30.3%) | 9117 | 14401 (+57.9%) |
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
+/*I think this is too aggressive. nr_cpu_ids is easily 4096 or similar for
+ * Number of mb last groups
+ */
+#ifdef CONFIG_SMP
+#define MB_LAST_GROUPS roundup_pow_of_two(nr_cpu_ids)
+#else
+#define MB_LAST_GROUPS 1
+#endif
+
distribution kernels (it is just a theoretical maximum for the number of
CPUs the kernel can support)
CONFIG_FORCE_NR_CPUS is enabled will nr_cpu_ids be set to NR_CPUS,
which represents the maximum number of supported CPUs.
I would not expect such root filesystem to be loaded by many bigwhich seems like far too much for smallIt does make sense.
filesystems with say 100 block groups.
I'd rather pick the array size like:You're right, we should consider the number of block groups when setting
min(num_possible_cpus(), sbi->s_groups_count/4)
to
a) don't have too many slots so we still concentrate big allocations in
somewhat limited area of the filesystem (a quarter of block groups here).
b) have at most one slot per CPU the machine hardware can in principle
support.
Honza
the number of global goals.
However, a server's rootfs can often be quite small, perhaps only tens of
GBs, while having many CPUs. In such cases, sbi->s_groups_count / 4 might
still limit the filesystem's scalability.
allocations in parallel :). And with 4k blocksize 32GB filesystem would
have already 64 goals which doesn't seem *that* limiting?
Also note that as the filesystem is filling up and the free space is gettingI don't think so. Although we're now splitting into multiple goals, these
fragmented, the number of groups where large allocation can succeed will
reduce. Thus regardless of how many slots for streaming goal you have, they
will all end up pointing only to those several groups where large
still allocation succeeds. So although large number of slots looks good for
an empty filesystem, the benefit for aged filesystem is diminishing and
larger number of slots will make the fs fragment faster.
I don't think that's necessary. We still need to consider block group lock
Furthermore, after supporting LBS, the number of block groups willRight. This is going to reduce scalability of block allocation in general.
sharply decrease.
Also as the groups grow larger with larger blocksize the benefit of
streaming allocation which just gives a hint about block group to use is
going to diminish when the free block search will be always starting from
0. We will maybe need to store ext4_fsblk_t (effectively combining
group+offset in a single atomic unit) as a streaming goal to mitigate this.
Having 'n' goals simply means we scan the groups 'n' times; it's not
How about we directly use sbi->s_groups_count (which would effectively beAvoiding zero values is definitely a good point. My concern is that if we
min(num_possible_cpus(), sbi->s_groups_count)) instead? This would also
avoid zero values.
have sb->s_groups_count streaming goals, then practically each group will
become a streaming goal group and thus we can just remove the streaming
allocation altogether, there's no benefit.
We could make streaming goal to be ext4_fsblk_t so that also offset of the
last big allocation in the group is recorded as I wrote above. That would
tend to pack big allocations in each group together which is benefitial to
combat fragmentation even with higher proportion of groups that are streaming
goals (and likely becomes more important as the blocksize and thus group
size grow). We can discuss proper number of slots for streaming allocation
(I'm not hung up on it being quarter of the group count) but I'm convinced
sb->s_groups_count is too much :)
Honza