Re: [PATCH] ext4: add optional rotating block allocation policy

From: Mario Lohajner

Date: Fri Feb 06 2026 - 14:37:43 EST

On 06. 02. 2026. 02:42, Theodore Tso wrote:

On Thu, Feb 05, 2026 at 01:23:18PM +0100, Mario Lohajner wrote:

Let me briefly restate the intent, focusing on the fundamentals.

Rotalloc is not wear leveling (and is intentionally not named as such).
It is a allocation policy whose goal is to reduce allocation hotspots by
enforcing mount-wide sequential allocation. Wear leveling, if any,
remains a device/firmware concern and is explicitly out of scope.
While WL motivated part of this work,

Yes, but *why* are you trying to reduce allocation hotspots? What
problem are you trying to solve? And actually, you are making
allocation hotspots *worse* since with global cursor, by definition
there is a single, super-hotspot. This will cause scalability issues
on a system with multiple CPU's trying to write in parallel.

Greetings Ted,

First off, apologies for the delayed reply — your emails somehow ended up in my spam! I hope this doesn’t happen again.
Also, sorry for the lengthy responses; I really care to make my points clear.

I’m not proposing that ext4 should implement or control wear leveling.
WL clearly does (or does not) exist below the FS layer and is opaque to us (we have no way of knowing).
What is observable in practice, however, is persistent allocation locality near the beginning of the LBA space under real workloads, and a corresponding concentration of wear in that area, interestingly it seems to be vendor-agnostic. = The force within is very strong :-)

The elephant:
My concern is a potential policy interaction: filesystem locality
policies tend to concentrate hot metadata and early allocations. During
deallocation, we naturally discard/trim those blocks ASAP to make them
ready for write, thus optimizing for speed, while at the same time signaling them as free. Meanwhile, an underlying WL policy (if present) tries to consume free blocks opportunistically.
If these two interact poorly, the result can be a sustained bias toward
low-LBA hot regions (as observable in practice).
The elephant is in the room and is called “wear” / hotspots at the LBA start.

the main added value of this patch is allocator separation.
The policy indirection (aka vectored allocator) allows allocation
strategies that are orthogonal to the regular allocator to operate
outside the hot path, preserving existing heuristics and improving
maintainability.

Allocator separation is not necessarily that an unalloyed good thing.
By having duplicated code, it means that if we need to make a change
in infrastructure code, we might now need to make it in multiple code
paths. It is also one more code path that we have to test and
maintain. So there is a real cost from the perspctive of the upstream
maintenance perspective.

My goal was to keep the regular allocator intact and trivially clean.
Baokun noticed this well — I’m using all existing heuristics; the only
tweak I do is to ‘fix the goal’ (i.e., set where to start), which then
sequentially advances toward the region most likely to contain empty,
unused space, at which point allocations become nearly instantaneous.

Being orthogonal in principle, these two allocators/policies are meant to live independently of each other.

Alternatively, we could drop the separation entirely and add a few
conditional branches to the regular allocator to the same effect,
but this introduces overhead, potential branch mispredictions, and all the associated shenanigans (minor but not insignifficant).
Separation avoids that, at the minimal cost of maintaining 20-ish extra
lines of code.
(memory we have; time is scarce)

Also, because having a single global allocation point (your "cursor")
is going to absolutely *trash* performance, especially for high speed
NVMe devices connected to high count CPU's, it's not clear to me why
performance is necessary for rotalloc.

The rotating allocator itself is a working prototype.
It was written with minimal diff and clarity in mind to make the policy
reviewable. Refinements and simplifications are expected and welcome.

OK, so this sounds like it's not ready for prime time....

I don’t consider it “not ready for prime time.” It is a rather simple refinement of the existing allocator, producing clean, contiguous layouts with sequential allocation across the LBA space without increase in complexity and with equal or lower latency.
Further refinements are anticipated and welcome — not because the current approach is flawed, but because this seems like an area where we can reasonably ask whether it can be even better.

Regarding discard/trim: while discard prepares blocks for reuse and
signals that a block is free, it does not implement wear leveling by
itself. Rotalloc operates at a higher layer; by promoting sequentiality,
it reduces block/group allocation hotspots regardless of underlying
device behavior.
Since it is not in line with the current allocator goals, it is
implemented as an optional policy.

Again, what is the high level goal of rotalloc? What specific
hardware and workload are you trying to optimize for? If you want to
impose a maintaince overhead on upstream, you need to justify why the
mainteance overhead is worth it. And so that means you need to be a
bit more explicit about what specific real-world solution you are
trying to solve....

- Ted

Again, we’re not focusing solely on wear leveling here, but since we
can’t influence the WL implementation itself, the only lever we have is
our own allocation policy.
The question I’m trying to sanity-check is whether we can avoid
reinforcing this pattern, and instead aim for an allocation strategy
that helps minimize the issue—or even avoid it entirely if possible.

Even though this pattern is clear in practice I’m not claiming this
applies universally, only that it appears often enough to be worth
discussing at the policy level. For that reason, it seems reasonable to
treat this as an optional policy choice, disabled by default.

Sincerely,
Mario