Re: [HELP] FUSE writeback performance bottleneck

From: Jingbo Xu
Date: Wed Sep 11 2024 - 05:34:25 EST

Next message: tip-bot2 for Christian Loehle: "[tip: sched/core] cpufreq/cppc: Use NSEC_PER_MSEC for deadline task"
Previous message: Francesco Dolcini: "Re: [PATCH] wifi: mwifiex: fix firmware crash for AP DFS mode"
Next in thread: Joanne Koong: "Re: [HELP] FUSE writeback performance bottleneck"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
>> IIUC, there are two sources that may cause deadlock:
>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
>> requests, which in turn triggers direct memory reclaim, and FUSE
>> writeback then - deadlock here
>
> Yep, see the folio_wait_writeback() call deep in the guts of direct
> reclaim, which sleeps until the PG_writeback flag is cleared. If that
> happens to be triggered by the writeback in question, then that's a
> deadlock.

After diving deep into the direct reclaim code, there are some insights
may be helpful.

Back to the time when the support for fuse writeback is introduced, i.e.
commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
direct reclaim indeed unconditionally waits for PG_writeback flag being
cleared. At that time the direct reclaim is implemented in a two-stage
style, stage 1) pass over the LRU list to start parallel writeback
asynchronously, and stage 2) synchronously wait for completion of the
writeback previously started.

This two-stage design and the unconditionally waiting for PG_writeback
flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
stall on writeback during memory compaction") since v3.5.

Though the direct reclaim logic continues to evolve and the waiting is
added back, now the stall will happen only when the direct reclaim is
triggered from kswapd or memory cgroup.

Specifically the stall will only happen in following certain conditions
(see shrink_folio_list() for details):
1) kswapd
2) or it's a user process under a non-root memory cgroup (actually
cgroup_v1) with GFP_IO permitted

Thus the potential deadlock does not exist actually (if I'm not wrong) if:
1) cgroup is not enabled
2) or cgroup_v2 is actually used
3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
server actually resides under the root cgroup
4) or (the fuse server resides under a non-root memory cgroup_v1), but
the fuse server advertises itself as a PR_IO_FLUSHER[1]

Then we could considering adding a new feature bit indicating that any
one of the above condition is met and thus the fuse server is safe from
the potential deadlock inside direct reclaim. When this feature bit is
set, the kernel side could bypass the temp page copying when doing
writeback.

As for the condition 4 (PR_IO_FLUSHER), there was a concern from
Miklos[2]. I think the new feature bit could be disabled by default,
and enabled only when the fuse server itself guarantees that it is in a
safe distribution condition. Even when it's enabled either by a mistake
or a malicious fuse server, and thus causes a deadlock, maybe the
sysadmin could still abort the connection through the abort sysctl knob?

Just some insights and brainstorm here.

[1] https://lore.kernel.org/all/Zl4%2FOAsMiqB4LO0e@xxxxxxxxxxxxxxxxxxx/
[2]
https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@xxxxxxxxxxxxxx/

--
Thanks,
Jingbo

Next message: tip-bot2 for Christian Loehle: "[tip: sched/core] cpufreq/cppc: Use NSEC_PER_MSEC for deadline task"
Previous message: Francesco Dolcini: "Re: [PATCH] wifi: mwifiex: fix firmware crash for AP DFS mode"
Next in thread: Joanne Koong: "Re: [HELP] FUSE writeback performance bottleneck"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]