Re: [PATCH 2/3] mm: throttle on IO only when there are too many dirty and writeback pages

From: Michal Hocko
Date: Thu Mar 17 2016 - 08:01:36 EST


On Thu 17-03-16 20:35:23, Tetsuo Handa wrote:
[...]
> But what I felt strange is what should_reclaim_retry() is doing.
>
> Michal Hocko wrote:
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f77e283fb8c6..b2de8c8761ad 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3044,8 +3045,37 @@ should_reclaim_retry(gfp_t gfp_mask, unsigned order,
> > */
> > if (__zone_watermark_ok(zone, order, min_wmark_pages(zone),
> > ac->high_zoneidx, alloc_flags, available)) {
> > - /* Wait for some write requests to complete then retry */
> > - wait_iff_congested(zone, BLK_RW_ASYNC, HZ/50);
> > + unsigned long writeback;
> > + unsigned long dirty;
> > +
> > + writeback = zone_page_state_snapshot(zone, NR_WRITEBACK);
> > + dirty = zone_page_state_snapshot(zone, NR_FILE_DIRTY);
> > +
> > + /*
> > + * If we didn't make any progress and have a lot of
> > + * dirty + writeback pages then we should wait for
> > + * an IO to complete to slow down the reclaim and
> > + * prevent from pre mature OOM
> > + */
> > + if (!did_some_progress && 2*(writeback + dirty) > reclaimable) {
> > + congestion_wait(BLK_RW_ASYNC, HZ/10);
> > + return true;
> > + }
>
> writeback and dirty are used only when did_some_progress == 0. Thus, we don't
> need to calculate writeback and dirty using zone_page_state_snapshot() unless
> did_some_progress == 0.

OK, I will move this into if !did_some_progress.

> But, does it make sense to take writeback and dirty into account when
> disk_events_workfn (trace shown above) is doing GFP_NOIO allocation and
> wb_workfn (trace shown above) is doing (presumably) GFP_NOFS allocation?
> Shouldn't we use different threshold for GFP_NOIO / GFP_NOFS / GFP_KERNEL?

I have considered skiping the throttling part for GFP_NOFS/GFP_NOIO
previously but I couldn't have convinced myself it would make any
difference. We know there was no progress in the reclaim and even if the
current context is doing FS/IO allocation potentially then it obviously
cannot get its memory so it cannot proceed. So now we are in the state
where we either busy loop or sleep for a while. So I ended up not
complicating the code even more. If you have a use case where busy
waiting makes a difference then I would vote for a separate patch with a
clear description.

> > +
> > + /*
> > + * Memory allocation/reclaim might be called from a WQ
> > + * context and the current implementation of the WQ
> > + * concurrency control doesn't recognize that
> > + * a particular WQ is congested if the worker thread is
> > + * looping without ever sleeping. Therefore we have to
> > + * do a short sleep here rather than calling
> > + * cond_resched().
> > + */
> > + if (current->flags & PF_WQ_WORKER)
> > + schedule_timeout(1);
>
> This schedule_timeout(1) does not sleep. You lost the fix as of next-20160317.
> Please update.

Yeah, I have that updated in my local patch already.

Thanks!
--
Michal Hocko
SUSE Labs