Re: [PATCHv2 1/7] zram: introduce compressed data writeback
From: zhangdongdong
Date: Thu Jan 08 2026 - 06:29:57 EST
On 1/8/26 11:39, Sergey Senozhatsky wrote:
Hi,
On (26/01/08 10:57), zhangdongdong wrote:
Do you use any strategies for writeback? Compressed writeback
is supposed to be used for apps for which latency is not critical
or sensitive, because of on-demand decompression costs.
Hi Sergey,
Sorry for the delayed reply — I had some urgent matters come up and only
got back to this now ;)
No worries, you reply in a perfectly reasonable time frame.
Yes, we do use writeback strategies on our side. The current implementation
focuses on batched writeback of compressed data from
zram, managed on a per-app / per-memcg basis. We track and control how
much data from each app is written back to the backing storage, with the
same assumption you mentioned: compressed writeback is primarily
intended for workloads where latency is not critical.
Accurate prefetching on swap-in is still an open problem for us. As you
pointed out, both the I/O itself and on-demand decompression introduce
additional latency on the readback path, and minimizing their impact
remains challenging.
Regarding the workqueue choice: initially we used system_dfl_wq for the
read/decompression path. Later, based on observed scheduling latency
under memory pressure, we switched to a dedicated workqueue created with
WQ_HIGHPRI | WQ_UNBOUND. This change helped reduce scheduling
interference, but it also reinforced our concern that deferring
decompression to a worker still adds an extra scheduling hop on the
swap-in path.
How bad (and often) is your memory pressure situation? I just wonder
if your case is an outlier, so to speak.
Just thinking aloud:
I really don't see a path back to atomic zram read/write. Those
were very painful and problematic, I do not consider a possibility
of re-introducing them, especially if the reason is an optional
feature (which comp-wb is). If we want to improve latency, we need
to find a way to do it without going back to atomic read/write,
assuming that latency becomes unbearable. But at the same time under
memory pressure everything becomes janky at some point, so I don't
know if comp-wb latency is the biggest problem in that case.
Dunno, *maybe* we can explore a possibility of grabbing both entry-lock
and per-CPU compression stream before we queue async bio, so that in
the bio completion we already *sort of* have everything we need.
However, that comes with a bunch of issues:
- the number of per-CPU compression streams is limited, naturally,
to the number of CPUs. So if we have a bunch of comp-wb reads we
can block all other activities: normal zram reads/writes, which
compete for the same per-CPU compressions streams.
- this still puts atomicity requirements on the compressors. I haven't
looked into, for instance, zstd *de*-compression code, but I know for
sure that zstd compression code allocates memory internally when
configured to use pre-trained CD-dictionaries, effectively making zstd
use GFP_ATOMIC allocations internally, if called from atomic context.
Do we have anything like that in decompression - I don't know. But in
general we cannot be sure that all compressors work in atomic context
in the same way as they do in non-atomic context.
I don't know if solving it on zram side alone is possible. Maybe we
can get some help from the block layer: some sort of two-stage bio
submission. First stage: submit chained bio-s, second stage: iterate
over all submitted and completed bio-s and decompress the data. Again,
just thinking out loud.
Hi Sergey,
My thinking is largely aligned with yours. I agree that relying on zram
alone is unlikely to fully solve this problem, especially without going
back to atomic read/write.
Our current mitigation approach is to introduce a hook at the swap layer
and move decompression there. By doing so, decompression happens in a
fully sleepable context, which avoids the atomic-context constraints
you outlined. This helps us sidestep the core issue rather than trying
to force decompression back into zram completion paths.
For reference, this is the change we are experimenting with:
https://android-review.googlesource.com/c/kernel/common/+/3724447
I also noticed that Richard proposed a similar optimization hook recently:
https://android-review.googlesource.com/c/kernel/common/+/3730147
Regarding your question about memory pressure: our current test case
runs on an 8 GB device, with around 50 apps being launched sequentially.
This creates fairly heavy memory pressure. In earlier tests using an
async kworker-based approach, we observed an average latency of about
1.3 ms,but with tail latencies occasionally reaching 30–100 ms.
If I recall correctly, this issue first became noticeable after a block
layer change was merged; I can try to dig that up and share more details
later.
Best regards,
dongdong