Re: [PATCH 1/5] mm/vmscan: Throttle reclaim until some writeback completes if congested

From: Dave Chinner
Date: Wed Sep 22 2021 - 02:05:02 EST


On Tue, Sep 21, 2021 at 11:58:31AM +0100, Mel Gorman wrote:
> On Tue, Sep 21, 2021 at 10:13:17AM +1000, NeilBrown wrote:
> > On Mon, 20 Sep 2021, Mel Gorman wrote:
> > > -long wait_iff_congested(int sync, long timeout)
> > > -{
> > > - long ret;
> > > - unsigned long start = jiffies;
> > > - DEFINE_WAIT(wait);
> > > - wait_queue_head_t *wqh = &congestion_wqh[sync];
> > > -
> > > - /*
> > > - * If there is no congestion, yield if necessary instead
> > > - * of sleeping on the congestion queue
> > > - */
> > > - if (atomic_read(&nr_wb_congested[sync]) == 0) {
> > > - cond_resched();
> > > -
> > > - /* In case we scheduled, work out time remaining */
> > > - ret = timeout - (jiffies - start);
> > > - if (ret < 0)
> > > - ret = 0;
> > > -
> > > - goto out;
> > > - }
> > > -
> > > - /* Sleep until uncongested or a write happens */
> > > - prepare_to_wait(wqh, &wait, TASK_UNINTERRUPTIBLE);
> >
> > Uninterruptible wait.
> >
> > ....
> > > +static void
> > > +reclaim_throttle(pg_data_t *pgdat, enum vmscan_throttle_state reason,
> > > + long timeout)
> > > +{
> > > + wait_queue_head_t *wqh = &pgdat->reclaim_wait;
> > > + unsigned long start = jiffies;
> > > + long ret;
> > > + DEFINE_WAIT(wait);
> > > +
> > > + atomic_inc(&pgdat->nr_reclaim_throttled);
> > > + WRITE_ONCE(pgdat->nr_reclaim_start,
> > > + node_page_state(pgdat, NR_THROTTLED_WRITTEN));
> > > +
> > > + prepare_to_wait(wqh, &wait, TASK_INTERRUPTIBLE);
> >
> > Interruptible wait.
> >
> > Why the change? I think these waits really need to be TASK_UNINTERRUPTIBLE.
> >
>
> Because from mm/ context, I saw no reason why the task *should* be
> uninterruptible. It's waiting on other tasks to complete IO and it is not
> protecting device state, filesystem state or anything else. If it gets
> a signal, it's safe to wake up, particularly if that signal is KILL and
> the context is a direct reclaimer.

I disagree. whether the sleep should be interruptable or
not is entirely dependent on whether the caller can handle failure
or not. If this is GFP_NOFAIL, allocation must not fail no matter
what the context is, so signals and the like are irrelevant.

For a context that can handle allocation failure, then it makes
sense to wake on events that will result in the allocation failing
immediately. But if all this does is make the allocation code go
around another retry loop sooner, then an interruptible sleep still
doesn't make any sense at all here...

> The original TASK_UNINTERRUPTIBLE is almost certainly a copy&paste from
> congestion_wait which may be called because a filesystem operation must
> complete before it can return to userspace so a signal waking it up is
> pointless.

Yup, but that AFAICT that same logic still applies. Only now it's
the allocation context that determines whether signal waking is
pointless or not...

Cheer,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx