Re: [f2fs-dev] [PATCH v2] f2fs: another way to set large folio by remembering inode number
From: Theodore Tso
Date: Tue May 26 2026 - 09:46:53 EST
On Tue, May 26, 2026 at 01:10:55AM +0000, Jaegeuk Kim wrote:
> Background
> ----------
> The primary use case is accelerating AI model loading, which demands
> exceptionally high sequential read speeds. In our benchmarks on embedded
> systems:
> - Using high-order page allocations allows the system to saturate the
> Universal Flash Storage (UFS) bandwidth, reaching 4 GB/s even at
> medium-to-low CPU frequencies.
> - In contrast, standard small folios cap performance at 2 GB/s.
So you're interested in optimizing the I/O speeds. And apparenty, on
your hardware, the UFS controller has limits on scatter-gather entries
--- UFS seems to call this Physical Region Description (PRD) table
entries. Per Gemini:
1. PRD Segment & Length Limits
Maximum PRD Entries: Hardware limits typically cap the number
of PRD entries (or segments) to 255 or 256 per transfer
request.
Maximum Transfer Length: Each individual PRD entry typically
allows a maximum transfer size of (65,535 bytes) per segment.
2. Host Controller Hardware Limits (UFSHCI)
Transfer Queue Depth: A UFS controller supports a predefined
number of outstanding task request entries. This is often
hard-capped at 32 concurrent transfer requests (slots) by the
doorbell register array.
Descriptor Pre-fetch: Some UFS host controllers are
pre-configured to pre-fetch multiple PRD entries sequentially
before requiring main memory reads.
Is this an accurate description of the limits that you are trying to
work with? How much data are you trying to read? Looking at Gemma 4
models, E2B is about 10GB or 3GB for the 4-bit quantized version. E4B
is 15GB, or 5GB for the 4-bit quantized version. Is that about right?
It seems... surprising that the additional I/O operations are actually
throttloing UFS device bandwidth by 2x (4GB/s vs 2GB/s). Have you dug
into why this is happening, and whether there is anything that can be
optimized below the file system?
> Problem Statement
> -----------------
> High-order pages become heavily fragmented and scarce shortly after
> device boot. We cannot afford to deplete these limited resources on
> default filesystem operations using large folios. Instead, we need a
> mechanism to strictly prioritize and reserve high-order allocations
> for specific, critical payloads—specifically, large AI model files.
There's a fundamental assumption here, which is that the only use of
high order pages is the page cache. This doesn't take into account
anonymous pages used by programs that isn't backed by files. Nor does
it take into account kernel memory allocations.
But that being said, you seem to be assuming that you can reduce the
pressure on high order pages by only using large folios for these AI
model files.
But the problem with using small folios is that if you want to
actually *use* the memory, unless you want to segment out the memory
so it can't be used for anything other than the AI models (e.g., by
using somthing like hugetlbfs) it's just going to break up the memory
into smaller folios. So that's not actually going to *help* in actual
real life use cases. It might help for your artificial benchmarks /
experiments, but in the real life case where Android applications are
running and fragmenting all of the device memory, the large folios
won't be available *anyway*.
>
> Q: Why is deregistering the inode number linked to inode deletion?
> A: We need the high-order allocation hint to persist even if the inode is
> temporarily evicted from the VFS cache. To achieve this, we maintain a tracking
> list of hinted inode numbers. When a file is permanently deleted, its hint
> becomes obsolete, requiring us to deregister it from the list to prevent memory
> leaks or identifier reuse conflicts.
Assuming that the high-order allocation hint is a good thing, why not
just make it persistent? e.g., just a *real* extended attribute
(which is more wateful of space), or grab a flag in the on-disk f2fs
inode? Then you don't need to have an in-memory list of hinted
inodes; instead, you can just have the Android package manager set
that flag indicating that you want that special treatment. This is
all assuming that we need an explicit hint, though....
> Massive AI model loading is a long-term architectural
> paradigm. Providing a targeted VFS/filesystem hint to optimize read
> bandwidth for specific large datasets is a highly practical,
> repeatable pattern that addresses a systemic bottleneck in embedded
> AI deployments.
It's really too bad you didn't propose this as a LSF/MM topic, and
presented this at a session at Zagreb two weeks ago. That would have
been a much more upstream-friendly way of collaborating, and it might
have allowed the mm experts to give you some more dynamic, real-time
feedback.
Cheers,
- Ted