Re: [PATCH] ext4: add optional rotating block allocation policy
From: Mario Lohajner
Date: Sat Feb 07 2026 - 07:55:29 EST
On 2/7/26 06:31, Theodore Tso wrote:
On Fri, Feb 06, 2026 at 08:25:24PM +0100, Mario Lohajner wrote:
What is observable in practice, however, is persistent allocation locality
near the beginning of the LBA space under real workloads, and a
corresponding concentration of wear in that area, interestingly it seems to
be vendor-agnostic. = The force within is very strong :-)
This is simply not true. Data blocks are *not* located to the
low-numbered LBA's in kind of reasonble real-world situation. Why do
you think this is true, and what was your experiment that led you
believe this?
Let me show you *my* experiment:
root@kvm-xfstests:~# /sbin/mkfs.ext4 -qF /dev/vdc 5g
root@kvm-xfstests:~# mount /dev/vdc /vdc
[ 171.091299] EXT4-fs (vdc): mounted filesystem 06dd464f-1c3a-4a2b-b3dd-e937c1e7624f r/w with ordered data mode. Quota mode: none.
root@kvm-xfstests:~# tar -C /vdc -xJf /vtmp/ext4-6.12.tar.xz
root@kvm-xfstests:~# ls -li /vdc
total 1080
31018 -rw-r--r-- 1 15806 15806 496 Dec 12 2024 COPYING
347 -rw-r--r-- 1 15806 15806 105095 Dec 12 2024 CREDITS
31240 drwxr-xr-x 75 15806 15806 4096 Dec 12 2024 Documentation
31034 -rw-r--r-- 1 15806 15806 2573 Dec 12 2024 Kbuild
31017 -rw-r--r-- 1 15806 15806 555 Dec 12 2024 Kconfig
30990 drwxr-xr-x 6 15806 15806 4096 Dec 12 2024 LICENSES
323 -rw-r--r-- 1 15806 15806 781906 Dec 1 21:34 MAINTAINERS
19735 -rw-r--r-- 1 15806 15806 68977 Dec 1 21:34 Makefile
14 -rw-r--r-- 1 15806 15806 726 Dec 12 2024 README
1392 drwxr-xr-x 23 15806 15806 4096 Dec 12 2024 arch
669 drwxr-xr-x 3 15806 15806 4096 Dec 1 21:34 block
131073 drwxr-xr-x 2 15806 15806 4096 Dec 12 2024 certs
31050 drwxr-xr-x 4 15806 15806 4096 Dec 1 21:34 crypto
143839 drwxr-xr-x 143 15806 15806 4096 Dec 12 2024 drivers
140662 drwxr-xr-x 81 15806 15806 4096 Dec 1 21:34 fs
134043 drwxr-xr-x 32 15806 15806 4096 Dec 12 2024 include
31035 drwxr-xr-x 2 15806 15806 4096 Dec 1 21:34 init
140577 drwxr-xr-x 2 15806 15806 4096 Dec 1 21:34 io_uring
140648 drwxr-xr-x 2 15806 15806 4096 Dec 1 21:34 ipc
771 drwxr-xr-x 22 15806 15806 4096 Dec 1 21:34 kernel
143244 drwxr-xr-x 20 15806 15806 12288 Dec 1 21:34 lib
11 drwx------ 2 root root 16384 Feb 6 16:34 lost+found
22149 drwxr-xr-x 6 15806 15806 4096 Dec 1 21:34 mm
19736 drwxr-xr-x 72 15806 15806 4096 Dec 12 2024 net
42649 drwxr-xr-x 7 15806 15806 4096 Dec 1 21:34 rust
349 drwxr-xr-x 42 15806 15806 4096 Dec 12 2024 samples
42062 drwxr-xr-x 19 15806 15806 12288 Dec 1 21:34 scripts
15 drwxr-xr-x 15 15806 15806 4096 Dec 1 21:34 security
131086 drwxr-xr-x 27 15806 15806 4096 Dec 12 2024 sound
22351 drwxr-xr-x 45 15806 15806 4096 Dec 12 2024 tools
31019 drwxr-xr-x 4 15806 15806 4096 Dec 12 2024 usr
324 drwxr-xr-x 4 15806 15806 4096 Dec 12 2024 virt
Note how different directories have different inode numbers, which are
in different block groups. This is how we naturally spread block
allocations across different block groups. This is *specifically* to
spread block allocations across the entire storage device. So for example:
root@kvm-xfstests:~# filefrag -v /vdc/arch/Kconfig
Filesystem type is: ef53
File size of /vdc/arch/Kconfig is 51709 (13 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 12: 67551.. 67563: 13: last,eof
/vdc/arch/Kconfig: 1 extent found
root@kvm-xfstests:~# filefrag -v /vdc/sound/Makefile
Filesystem type is: ef53
File size of /vdc/sound/Makefile is 562 (1 block of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 0: 574197.. 574197: 1: last,eof
/vdc/sound/Makefile: 1 extent found
See? The are not spread across LBA's. Quod Erat Demonstratum.
By the way, spreading block allocations across LBA's was not done
because of a concern about flash storage. The ext2, ext3, and ewxt4
filesysetm has had this support going over a quarter of a century,
because spreading the blocks across file system avoids file
fragmentation. It's a technique that we took from BSD's Fast File
System, called the Orlov algorithm. For more inforamtion, see [1], or
in the ext4 sources[2].
[1] https://en.wikipedia.org/wiki/Orlov_block_allocator
[2] https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/tree/fs/ext4/ialloc.c#n398
My concern is a potential policy interaction: filesystem locality
policies tend to concentrate hot metadata and early allocations. During
deallocation, we naturally discard/trim those blocks ASAP to make them
ready for write, thus optimizing for speed, while at the same time signaling
them as free. Meanwhile, an underlying WL policy (if present) tries to
consume free blocks opportunistically.
If these two interact poorly, the result can be a sustained bias toward
low-LBA hot regions (as observable in practice).
The elephant is in the room and is called “wear” / hotspots at the LBA
start.
First of all, most of the "sustained bias towards low-LBA regions" is
not because where data blocks are located, but because of the location
of static metadata blocks in particular, the superblock, block group
descriptors, and the allocation bitmaps. Having static metadata is
not unique to ext2/ext3/ext4. The FAT file system has the File
Allocation Table in low numbered LBA's, which are constantly updated
whenever blocks are allocated. Even log structured file systems, such
as btrfs, f2fs, and ZFS have a superblock at a static location which
gets rewriten at every file system commit.
Secondly, *because* all file systems rewrite certain LBA's, and how
flash erase blocks work, pretty much all flash translation layers for
the past two decades are *designed* to be able to deal with it.
Because of Digital Cameras and the FAT file systems, pretty much all
flash storage do *not* have a static mapping between a particular LBA
and a specific set of flash cells. The fact that you keep asserting
that "hotspots at the LBA start" is a problem indicates to me that you
don't understand how SSD's work in real life.
So I commend to you these two articles:
https://flashdba.com/2014/06/20/understanding-flash-blocks-pages-and-program-erases/
https://flashdba.com/2014/09/17/understanding-flash-the-flash-translation-layer/
These web pages date from 12 years ago, because SSD technology is in
2026, very old technology in an industry where two years == infinity.
For a more academic perspective, there's the paper from the
conference: 2009 First International Conference on Advances in System
Simulation, published by researchers from Pennsylvania State
University:
https://www.cse.psu.edu/~buu1/papers/ps/flashsim.pdf
The FlashSim is available as open source, and has since been used by
many other researchers to explore improvements in Flash Translation
Layer. And even the most basic FTL algorithms mean that your proposed
RotAlloc is ***pointless***. If you think otherwise, you're going to
need to provide convincing evidence.
Hi Ted,
Let me try to clarify this in a way that avoids talking past each other.
I fully agree with the allocator theory, the Orlov algorithm, and with
your demonstration.
I am not disputing *anything*, nor have I ever intended to.
The pattern I keep referring to as “observable in practice” is about
repeated free -> reallocate cycles, allocator restart points, and reuse
bias - i.e., which regions of the address space are revisited most
frequently over time.
Again, we’re not focusing solely on wear leveling here, but since we
can’t influence the WL implementation itself, the only lever we have is
our own allocation policy.
You claim that you're not focusing on wear leveling, but every single
justification for your changes reference "wear / hotspotting". I'm
trying to tell you that it's not an issue. If you think it *could* be
an issue, *ever*, you need to provide *proof* --- at the very least,
proof that you understand things like how flash erase blocks work, how
flash translation layers work, and how the orlov block allocation
algorithm works. Because with all due respect, it appears that you
are profoundly ignorant, and it's not clear why we should be
respecting your opinion and your arguments. If you think we should,
you really need to up your game.
Regards,
- Ted
Although I admitted being WL-inspired right from the start, I maintain that *this is not* wear leveling - WL deals with reallocations, translations, amplification history... This simply *is not* that.
Calling it "wear leveling" would be like an election promise - it might, but probably won’t, come true.
The question I’m raising is much narrower: whether allocator
policy choices can unintentionally reinforce reuse patterns under
certain workloads - and whether offering an *alternative policy* is
reasonable (I dare to say; in some cases more optimal).
I was consciously avoiding turning this into a “your stats vs. my stats”
&| “your methods vs. my methods” discussion.
However, to avoid arguing from theory alone, I will follow up with a
small set of real-world examples.
https://github.com/mlohajner/elephant-in-the-room
These are snapshots from different systems, illustrating the point I’m presenting here. Provided as-is, without annotations; while they do not show the allocation bitmap explicitly, they are statistically correlated with the most frequently used blocks/groups across the LBA space.
Given that another maintainer has already expressed support for making
this an *optional policy; disabled by default* I believe this discussion
is less about allocator theory correctness and more about whether
accommodating real-world workload diversity is desirable.
Regards,
Mario
P.S.
I'm so altruistic I dare say this out loud:
At this point, my other concern is this: if we reach common ground and make it optional, and it truly helps more than it hurts, who will actually ever use it? :-)
(Assuming end users even know it exists, to adopt it in a way that feels like a natural progression/improvement.)