Re: [HELP] FUSE writeback performance bottleneck

From: Joanne Koong
Date: Thu Sep 12 2024 - 19:23:19 EST

Next message: Deepak Gupta: "[PATCH v4 16/30] riscv/shstk: If needed allocate a new shadow stack on clone"
Previous message: Deepak Gupta: "[PATCH v4 15/30] riscv/mm: Implement map_shadow_stack() syscall"
In reply to: Jingbo Xu: "Re: [HELP] FUSE writeback performance bottleneck"
Next in thread: Jingbo Xu: "Re: [HELP] FUSE writeback performance bottleneck"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
> Hi all,
>
> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> > On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >
> >> IIUC, there are two sources that may cause deadlock:
> >> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> >> requests, which in turn triggers direct memory reclaim, and FUSE
> >> writeback then - deadlock here
> >
> > Yep, see the folio_wait_writeback() call deep in the guts of direct
> > reclaim, which sleeps until the PG_writeback flag is cleared. If that
> > happens to be triggered by the writeback in question, then that's a
> > deadlock.
>
> After diving deep into the direct reclaim code, there are some insights
> may be helpful.
>
> Back to the time when the support for fuse writeback is introduced, i.e.
> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
> direct reclaim indeed unconditionally waits for PG_writeback flag being
> cleared. At that time the direct reclaim is implemented in a two-stage
> style, stage 1) pass over the LRU list to start parallel writeback
> asynchronously, and stage 2) synchronously wait for completion of the
> writeback previously started.
>
> This two-stage design and the unconditionally waiting for PG_writeback
> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
> stall on writeback during memory compaction") since v3.5.
>
> Though the direct reclaim logic continues to evolve and the waiting is
> added back, now the stall will happen only when the direct reclaim is
> triggered from kswapd or memory cgroup.
>
> Specifically the stall will only happen in following certain conditions
> (see shrink_folio_list() for details):
> 1) kswapd
> 2) or it's a user process under a non-root memory cgroup (actually
> cgroup_v1) with GFP_IO permitted
>
> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
> 1) cgroup is not enabled
> 2) or cgroup_v2 is actually used
> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
> server actually resides under the root cgroup
> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
> the fuse server advertises itself as a PR_IO_FLUSHER[1]
>
>
> Then we could considering adding a new feature bit indicating that any
> one of the above condition is met and thus the fuse server is safe from
> the potential deadlock inside direct reclaim. When this feature bit is
> set, the kernel side could bypass the temp page copying when doing
> writeback.
>

Hi Jingbo, thanks for sharing your analysis of this.

Having the temp page copying gated on the conditions you mentioned
above seems a bit brittle to me. My understanding is that the mm code
for when it decides to stall or not stall can change anytime in the
future, in which case that seems like it could automatically break our
precondition assumptions. Additionally, if I'm understanding it
correctly, we also would need to know if the writeback is being
triggered from reclaim by kswapd - is there even a way in the kernel
to check that?

I'm wondering if there's some way we could tell if a folio is under
reclaim when we're writing it back. I'm not familiar yet with the
reclaim code, but my initial thoughts were whether it'd be possible to
purpose the PG_reclaim flag or perhaps if the folio is not on any lru
list, as an indication that it's being reclaimed. We could then just
use the temp page in those cases, and skip the temp page otherwise.

Could you also point me to where in the reclaim code we end up
invoking the writeback callback? I see pageout() calls ->writepage()
but I'm not seeing where we invoke ->writepages().

Thanks,
Joanne

>
> As for the condition 4 (PR_IO_FLUSHER), there was a concern from
> Miklos[2]. I think the new feature bit could be disabled by default,
> and enabled only when the fuse server itself guarantees that it is in a
> safe distribution condition. Even when it's enabled either by a mistake
> or a malicious fuse server, and thus causes a deadlock, maybe the
> sysadmin could still abort the connection through the abort sysctl knob?
>
>
> Just some insights and brainstorm here.
>
>
> [1] https://lore.kernel.org/all/Zl4%2FOAsMiqB4LO0e@xxxxxxxxxxxxxxxxxxx/
> [2]
> https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@xxxxxxxxxxxxxx/
>
>
>
> --
> Thanks,
> Jingbo

Next message: Deepak Gupta: "[PATCH v4 16/30] riscv/shstk: If needed allocate a new shadow stack on clone"
Previous message: Deepak Gupta: "[PATCH v4 15/30] riscv/mm: Implement map_shadow_stack() syscall"
In reply to: Jingbo Xu: "Re: [HELP] FUSE writeback performance bottleneck"
Next in thread: Jingbo Xu: "Re: [HELP] FUSE writeback performance bottleneck"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]