Re: [HELP] FUSE writeback performance bottleneck
From: Joanne Koong
Date: Fri Sep 13 2024 - 16:55:32 EST
On Thu, Sep 12, 2024 at 8:35 PM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
>
> On 9/13/24 7:18 AM, Joanne Koong wrote:
> > On Wed, Sep 11, 2024 at 2:32 AM Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>
> >> Hi all,
> >>
> >> On 6/4/24 3:27 PM, Miklos Szeredi wrote:
> >>> On Tue, 4 Jun 2024 at 03:57, Jingbo Xu <jefflexu@xxxxxxxxxxxxxxxxx> wrote:
> >>>
> >>>> IIUC, there are two sources that may cause deadlock:
> >>>> 1) the fuse server needs memory allocation when processing FUSE_WRITE
> >>>> requests, which in turn triggers direct memory reclaim, and FUSE
> >>>> writeback then - deadlock here
> >>>
> >>> Yep, see the folio_wait_writeback() call deep in the guts of direct
> >>> reclaim, which sleeps until the PG_writeback flag is cleared. If that
> >>> happens to be triggered by the writeback in question, then that's a
> >>> deadlock.
> >>
> >> After diving deep into the direct reclaim code, there are some insights
> >> may be helpful.
> >>
> >> Back to the time when the support for fuse writeback is introduced, i.e.
> >> commit 3be5a52b30aa ("fuse: support writable mmap") since v2.6.26, the
> >> direct reclaim indeed unconditionally waits for PG_writeback flag being
> >> cleared. At that time the direct reclaim is implemented in a two-stage
> >> style, stage 1) pass over the LRU list to start parallel writeback
> >> asynchronously, and stage 2) synchronously wait for completion of the
> >> writeback previously started.
> >>
> >> This two-stage design and the unconditionally waiting for PG_writeback
> >> flag being cleared is removed by commit 41ac199 ("mm: vmscan: do not
> >> stall on writeback during memory compaction") since v3.5.
> >>
> >> Though the direct reclaim logic continues to evolve and the waiting is
> >> added back, now the stall will happen only when the direct reclaim is
> >> triggered from kswapd or memory cgroup.
> >>
> >> Specifically the stall will only happen in following certain conditions
> >> (see shrink_folio_list() for details):
> >> 1) kswapd
> >> 2) or it's a user process under a non-root memory cgroup (actually
> >> cgroup_v1) with GFP_IO permitted
> >>
> >> Thus the potential deadlock does not exist actually (if I'm not wrong) if:
> >> 1) cgroup is not enabled
> >> 2) or cgroup_v2 is actually used
> >> 3) or (memory cgroup is enabled and is attached upon cgroup_v1) the fuse
> >> server actually resides under the root cgroup
> >> 4) or (the fuse server resides under a non-root memory cgroup_v1), but
> >> the fuse server advertises itself as a PR_IO_FLUSHER[1]
> >>
> >>
> >> Then we could considering adding a new feature bit indicating that any
> >> one of the above condition is met and thus the fuse server is safe from
> >> the potential deadlock inside direct reclaim. When this feature bit is
> >> set, the kernel side could bypass the temp page copying when doing
> >> writeback.
> >>
> >
> > Hi Jingbo, thanks for sharing your analysis of this.
> >
> > Having the temp page copying gated on the conditions you mentioned
> > above seems a bit brittle to me. My understanding is that the mm code
> > for when it decides to stall or not stall can change anytime in the
> > future, in which case that seems like it could automatically break our
> > precondition assumptions.
>
> So this is why PR_IO_FLUSHER is introduced here, which is specifically
> for user space components playing a role in IO stack, e.g. fuse daemon,
> tcmu/nbd daemon, etc. PR_IO_FLUSHER offers guarantee similar to
> GFP_NOIO, but for user space components. At least we can rely on the
> assumption that mm would take PR_IO_FLUSHER into account.
>
> The limitation of the PR_IO_FLUSHER approach is that, as pointed by
> Miklos[1], there may be multiple components or services involved to
> service the fuse requests, and the kernel side has no effective way to
> check if all services in the whole chain have set PR_IO_FLUSHER.
>
Right, so doesn't that still bring us back to the original problem
where if we gate this on any of the one conditions being enough to
bypass needing the temp page, if the conditions change anytime in the
future in the mm code, then this would automatically open up the
potential deadlock in fuse as a byproduct? That seems a bit brittle to
me to have this dependency.
The other alternatives seem to be:
* adding a timer to writeback requests [1] where if the pages have not
been copied out to userspace by a certain amount of time, then the
handler copies out those pages to temporary pages and immediately
clears writeback on the pages. The timer is canceled as soon as the
pages will be copied out to userspace.
* (not sure how possible this is) add some way to tag pages being
reclaimed/balanced (I saw your comment below about the
->migrate_folio() call, which I need to look more into)
The timeout option seems like the most promising one. I don't think
the code would be that ugly.
Curious to hear your thoughts on this. Are there any other
alternatives you think could work here?
[1] https://lore.kernel.org/all/CAJfpegt_mEYOeeTo2bWS3iJfC38t5bf29mzrxK68dhMptrgamg@xxxxxxxxxxxxxx/
>
> > Additionally, if I'm understanding it
> > correctly, we also would need to know if the writeback is being
> > triggered from reclaim by kswapd - is there even a way in the kernel
> > to check that?
>
> Nope. What I mean in the previous email is that, kswapd can get stalled
> in direct reclaim, while the normal process, e.g. the fuse server, may
> not get stalled in certain condition, e.g. explicitly advertising
> PR_IO_FLUSHER.
>
Gotcha. I just took a look at shrink_folio_list() and now I see the
"current_is_kswapd()" check.
> >
> > I'm wondering if there's some way we could tell if a folio is under
> > reclaim when we're writing it back. I'm not familiar yet with the
> > reclaim code, but my initial thoughts were whether it'd be possible to
> > purpose the PG_reclaim flag or perhaps if the folio is not on any lru
> > list, as an indication that it's being reclaimed. We could then just
> > use the temp page in those cases, and skip the temp page otherwise.
>
> That is a good idea but I'm afraid it doesn't works. Explained below.
>
> >
> > Could you also point me to where in the reclaim code we end up
> > invoking the writeback callback? I see pageout() calls ->writepage()
> > but I'm not seeing where we invoke ->writepages().
>
> Yes, the direct reclaim would end up calling ->writepage() to writeback
> the dirty page. ->writepages() is only called in normal writeback
> routine, e.g. when triggered from balance_dirty_page().
>
> Also FYI FUSE has removed ->writepage() since commit e1c420a ("fuse:
> Remove fuse_writepage"), and now it relies on ->migrate_folio(), i.e.
> memory compacting and the normal writeback routine (triggered from
> balance_dirty_page()) in low memory.
>
> Thus I'm afraid the approach of doing temp page copying only for
> writeback from direct reclaim code actually doesn't work. That's
> because when doing the direct reclaim, the process not only waits for
> the writeback completion submitted from direct reclaim (e.g. marked with
> PG_reclaim, by ->writepage), but may also waits for that submitted from
> the normal writeback routine (without PG_reclaim marked, by
> ->writepages). See commit c3b94f4 ("memcg: further prevent OOM with too
> many dirty pages").
>
Thanks for the explanation! This is very helpful. The reliance on
->migrate_folio() for reclaim is the piece I was missing.
>
>
> [1]
> https://lore.kernel.org/all/CAJfpegvYpWuTbKOm1hoySHZocY+ki07EzcXBUX8kZx92T8W6uQ@xxxxxxxxxxxxxx/
>
> --
> Thanks,
> Jingbo
Thanks,
Joanne