Re: [PATCH v4 0/8] Remove dependency on congestion_wait in mm/
From: NeilBrown
Date: Tue Oct 26 2021 - 20:43:37 EST
On Sat, 23 Oct 2021, Mel Gorman wrote:
> On Fri, Oct 22, 2021 at 10:26:30PM +1100, NeilBrown wrote:
> > On Fri, 22 Oct 2021, Mel Gorman wrote:
> > > On Fri, Oct 22, 2021 at 12:15:10PM +1100, NeilBrown wrote:
> > >
> > > > In general, I still don't like the use of wake_up_all(), though it won't
> > > > cause incorrect behaviour.
> > > >
> > >
> > > Removing wake_up_all would be tricky.
> >
> > I think there is a misunderstanding. Removing wake_up_all() is as
> > simple as
> > s/wake_up_all/wake_up/
> >
> > If you used prepare_to_wait_exclusive(), then wake_up() would only wake
> > one waiter, while wake_up_all() would wake all of them.
> > As you use prepare_to_wait(), wake_up() will wake all waiters - as will
> > wake_up_all().
> >
>
> Ok, yes, there was a misunderstanding. I thought you were suggesting a
> move to exclusive wakeups. I felt that the wake_up_all was explicit in
> terms of intent and that I really meant for all tasks to wake instead of
> one at a time.
Fair enough. Thanks for changing it :-)
But this prompts me to wonder if exclusive wakeups would be a good idea
- which is a useful springboard to try to understand the code better.
For VMSCAN_THROTTLE_ISOLATED they probably are.
One pattern for reliable exclusive wakeups is for any thread that
received a wake-up to then consider sending a wake up.
Two places receive VMSCAN_THROTTLE_ISOLATED wakeups and both then call
too_many_isolated() which - on success - sends another wakeup - before
the caller has had a chance to isolate anything. If, instead, the
wakeup was sent sometime later, after pages were isolated by before the
caller (isoloate_migratepages_block() or shrink_inactive_list())
returned, then we would get an orderly progression of threads running
through that code.
For VMSCAN_THROTTLE_WRITEBACK is a little less straight forward.
There are two different places that wait for the wakeup, and a wake_up
is sent to all waiters after a time proportional to the number of
waiters. It might make sense to wake one thread per time unit?
That might work well for do_writepages - every SWAP_CLUSTER_MAX writes
triggers one wakeup.
I'm less sure that it would work for shrink_node(). Maybe the
shrink_node() waiters could be non-exclusive so they get woken as soon a
SWAP_CLUSTER_MAX writes complete, while do_writepages are exclusive and
get woken one at a time.
For VMSCAN_THROTTLE_NOPROGRESS .... I don't understand.
If one zone isn't making "enough" progress, we throttle before moving on
to the next zone. So we delay processing of the next zone, and only
indirectly delay re-processing of the current congested zone.
Maybe it make sense, but I don't see it yet. I note that the commit
message says "it's messy". I can't argue with that!
I'll follow up with patches to clarify what I am thinking about the
first two. I'm not proposing the patches, just presenting them as part
of improving my understanding.
Thanks,
NeilBrown