RE:(2) [PATCH v2 2/2] f2fs: pack same-inode blocks by inode during FG_GC

From: Daejun Park

Date: Tue Jun 30 2026 - 00:25:15 EST

On 6/29/26 20:08, Chao Yu wrote:
> On 6/5/26 17:13, Daejun Park wrote:

Hi Chao,

Thanks for the review. Replies inline.

>> Natural FG_GC under tight cold migration (mkfs.f2fs -s 32, 2 GiB disk
>> 90 % fill, 6 hot x 200 MiB + 6 cold x 100 MiB interleaved, 300 s rewrite):
>> legacy cold extents 350 -> 357 (delta +7, no improvement)
>> packed cold extents 350 -> 132 (delta -218, -63 % reduction)
>> per user iter:
>> move_blks legacy 42344 packed 34822 (-18 %)
>> skipped_gc_rwsem legacy 108 packed 44 (-59 %)
>> hot rewrite iters in fixed 300 s window: +45 %
>
> Can you please explain why we can get above benefits except per-file defragment
> during FGGC? Not sure, due to in paralell rewriting/removing races w/ fggc on
> the same section?

The primary effect is per-file defragmentation; the move_blks and
skipped_gc_rwsem deltas follow from it. Grouping by inode writes each file's
surviving blocks as one contiguous run, so a destination section fills with
whole-file runs that share a lifetime instead of an interleave of many inodes.
An all-cold section is rarely re-invalidated by the hot rewrites, so it drops
out of the FG_GC victim pool and its cold blocks stop being re-collected; in the
legacy order the section mixes several inodes, the hotter ones keep partially
invalidating it, it is GC'd again, and the cold blocks are dragged along -- that
repeated re-collection is the extra move_blks.

skipped_gc_rwsem (per-block i_gc_rwsem trylock failures against an inode a writer
holds) drops for the same reason: fewer migrations overall, and increasingly
all-cold victims hold fewer blocks of the inodes being actively rewritten. Lock
granularity is unchanged (still per block). So your intuition about the
interaction with the parallel rewrite is right, but the root cause is the
layout, not locking.

> May be you can provide scripts in somewhere? :)

Sure -- core reproducer below; run it once with arg 0 and once with 1:

#!/bin/bash
# natural-FG_GC packing A/B: -s 32, cold+hot interleaved ~90%, 300s rewrite. arg = packing 0|1
DEV=/dev/sdX; MNT=/mnt/test; PACK=$1
mkfs.f2fs -f -s 32 "$DEV" >/dev/null
mount -t f2fs -o background_gc=sync "$DEV" "$MNT"; mkdir -p "$MNT/d"
echo "$PACK" > /sys/fs/f2fs/$(basename $DEV)/gc_inode_local_packing
for c in $(seq 0 199); do # 6 hot x200M + 6 cold x100M, 1M at a time -> mixed sections
for i in 1 2 3 4 5 6; do dd if=/dev/urandom of=$MNT/d/hot_$i bs=1M count=1 seek=$c oflag=sync status=none; done
[ $c -lt 100 ] && for i in 1 2 3 4 5 6; do dd if=/dev/urandom of=$MNT/d/cold_$i bs=1M count=1 seek=$c oflag=sync status=none; done
done; sync
pre=$(filefrag $MNT/d/cold_* | grep -oE '[0-9]+ extent' | awk '{s+=$1}END{print s}')
end=$(( $(date +%s) + 300 )) # hot rewrite invalidates hot -> FG_GC picks mixed -> migrates cold
while [ $(date +%s) -lt $end ]; do
for i in 1 2 3 4 5 6; do dd if=/dev/urandom of=$MNT/d/hot_$i bs=1M count=200 oflag=sync status=none & done; wait
done; sync
post=$(filefrag $MNT/d/cold_* | grep -oE '[0-9]+ extent' | awk '{s+=$1}END{print s}')
echo "pack=$PACK cold_extents $pre -> $post move_blks=$(awk '/Try to move/{print $4}' /sys/kernel/debug/f2fs/status) skipped=$(cat /sys/fs/f2fs/$(basename $DEV)/total_skipped_gc_rwsem)"
umount "$MNT"

I can also add an A/B with the concurrent writers enabled vs serialized to
isolate the concurrency contribution you mentioned.

>> +#define MAX_GC_PACK_BLOCKS 4096
>
> What do you think of introducing a sysfs for this parameter?

Sure -- I'll expose it as an RW sysfs knob in v3 (default 4096), like
migration_granularity.

>> + if (pack_by_inode)
>> + submitted += pack_gc_section(sbi, gc_list, gc_type);
>> +
>> stop:
>
> Should we drop all items in gc_block list belong to current sections for
> anyone jumps to 'stop' label? so that in next round of section migration,
> we can avoid to touch block list in previous section.

Good point. f2fs_gc() reuses one gc_list across sections within a single call
and only calls put_gc_inode() at the very end. On the normal path that's fine
because pack_gc_section() drains every queued record and resets nr_gc_blocks
before do_garbage_collect() returns. But on the freezing 'goto stop' path the
drain is skipped, so the records survive on the inode_entries; if freezing
clears before f2fs_gc()'s next freeze check, the next section's pack pass would
walk those stale records and migrate blocks belonging to the previous section.

So v3 will drop the section's queued gc_blocks on that path (a small helper that
frees them and resets nr_gc_blocks) so each do_garbage_collect() leaves the list
clean.

>> + if (pack_by_inode) {
>> + unsigned int seg;
>> +
>> + for (seg = start_segno; seg < end_segno; seg++)
>> + if (get_valid_blocks(sbi, seg, false) == 0)
>> + seg_freed++;
>> + }
>
> Is it possible there is empty segment previously? we should not account it
> into seg_freed?

It can, but this matches legacy and is required for the count that matters. In
the original loop an already-empty segment also takes 'goto freed' and is
counted by the same seg_freed++; the recompute reproduces that after the
deferred migration. It has to include them, because sec_freed is bumped only
when seg_freed == f2fs_usable_segs_in_sec() (the whole section empty); excluding
already-empty segments would under-count a fully-reclaimable section and make a
sync F2FS_IOC_GC return -EAGAIN, the regression this v2 fixes.

You're right that skewing the tracepoint is itself a downside, though. The skew
is only on the freezing path: the recompute sits after the stop: label, so a
frozen GC -- which migrated nothing -- still scans the full range and can report
pre-existing-empty segments as freed in total_freed (the -EAGAIN decision is
unaffected; it goes through total_sec_freed, gated by the equality). v3 will run
the recompute only on the normal completion path (move it above stop:), so a
frozen GC no longer inflates the trace; on the normal path the count already
equals what legacy reports.

>> + sbi->gc_inode_local_packing = __is_large_section(sbi);
>
> I don't think we should enable a new feature by default, due to it brings
> bugs commonly.

Agreed. v3 will default it to 0 (disabled) and let users opt in via the sysfs
knob; I'll update the ABI doc wording accordingly.

Thanks,
Daejun