Re: [RFC v3] writeback: add elastic bdi in cgwb bdp

From: Andrew Morton
Date: Tue Nov 12 2019 - 20:02:42 EST


On Tue, 12 Nov 2019 11:42:27 +0800 Hillf Danton <hdanton@xxxxxxxx> wrote:

>
> The elastic bdi (ebdi) which is the mirror bdi of spinning disk,
> SSD and USB key on market is introduced to balancing dirty pages
> (bdp).
>
> The risk arises that system runs out of free memory, when dirty
> pages are produced too many too soon, so bdp is needed in field.
>
> Ebdi facilitates bdp in elastic time intervals e.g. from a jiffy
> to one HZ, depending on the time it would take to increase dirty
> pages by the amount which is defined by the variable
> ratelimit_pages.
>
> During cgroup writeback (cgwb) bdp, ebdi helps observe the
> changes both in cgwb's dirty pages (dirty speed) and in
> written-out pages (laundry speed) in elastic time intervals,
> until a balance is established between the two parties i.e.
> the two speeds statistically equal.
>
> The above mechanism of elastic equilibrium effectively prevents
> dirty page hogs, as no chance is left for dirty pages to pile up,
> thus cuts the risk that system free memory falls to unsafe level.
>
> Thanks to Rong Chen for testing.

That sounds like a Tested-by:

The changelog has no testing results. Please prepare results which
show, amongst other things, the change in performance when the kernel
isn't tight on memory. As well as the alteration in behaviour when
memory is short.

Generally, please work on making this code much more understandable?

>
> ...
>
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
> @@ -811,6 +811,8 @@ static long wb_split_bdi_pages(struct bd
> if (nr_pages == LONG_MAX)
> return LONG_MAX;
>
> + return nr_pages;
> +
> /*
> * This may be called on clean wb's and proportional distribution
> * may not make sense, just use the original @nr_pages in those
> @@ -1604,6 +1606,7 @@ static long writeback_chunk_size(struct
> pages = min(pages, work->nr_pages);
> pages = round_down(pages + MIN_WRITEBACK_PAGES,
> MIN_WRITEBACK_PAGES);
> + pages = work->nr_pages;

It's unclear what this is doing, but it makes the three preceding
statements non-operative.

> }
>
> return pages;
> @@ -2092,6 +2095,9 @@ void wb_workfn(struct work_struct *work)
> wb_wakeup_delayed(wb);
>
> current->flags &= ~PF_SWAPWRITE;
> +
> + if (waitqueue_active(&wb->bdp_waitq))
> + wake_up_all(&wb->bdp_waitq);

Please add a comment explaining why this is being done here.

> }
>
> /*
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1830,6 +1830,67 @@ pause:
> wb_start_background_writeback(wb);
> }
>
> +/**
> + * cgwb_bdp_should_throttle() tell if a wb should be throttled
> + * @wb bdi_writeback to throttle
> + *
> + * To avoid the risk of exhausting the system free memory, we check
> + * and try much to prevent too many dirty pages from being produced
> + * too soon.
> + *
> + * For cgroup writeback, it is essencially to keep an equilibrium

"it is essential"?

> + * between its dirty speed and laundry speed i.e. dirty pages are
> + * written out as fast as they are produced in an ideal state.
> + */
> +static bool cgwb_bdp_should_throttle(struct bdi_writeback *wb)
> +{
> + struct dirty_throttle_control gdtc = { GDTC_INIT_NO_WB };
> +
> + if (fatal_signal_pending(current))
> + return false;
> +
> + gdtc.avail = global_dirtyable_memory();
> +
> + domain_dirty_limits(&gdtc);
> +
> + gdtc.dirty = global_node_page_state(NR_FILE_DIRTY) +
> + global_node_page_state(NR_UNSTABLE_NFS) +
> + global_node_page_state(NR_WRITEBACK);
> +
> + if (gdtc.dirty < gdtc.bg_thresh)
> + return false;
> +
> + if (!writeback_in_progress(wb))
> + wb_start_background_writeback(wb);

This is a bit ugly. Something called "bool cgwb_bdp_should_throttle()"
shoiuld just check whether we should throttle. But here it is, also
initiating writeback. That's an inappropriate thing for this function
to do?

Also, we don't know *why* this is being done here, because there's no
code comment explaining the reasoning to us.


> + if (gdtc.dirty < gdtc.thresh)
> + return false;
> +
> + /*
> + * throttle wb if there is the risk that wb's dirty speed is
> + * running away from its laundry speed, better with statistic
> + * error taken into account.
> + */
> + return wb_stat(wb, WB_DIRTIED) >
> + wb_stat(wb, WB_WRITTEN) + wb_stat_error();
> +}
> +
>
> ...
>
> @@ -1888,29 +1945,38 @@ void balance_dirty_pages_ratelimited(str
> * 1000+ tasks, all of them start dirtying pages at exactly the same
> * time, hence all honoured too large initial task->nr_dirtied_pause.
> */
> - p = this_cpu_ptr(&bdp_ratelimits);
> - if (unlikely(current->nr_dirtied >= ratelimit))
> - *p = 0;
> - else if (unlikely(*p >= ratelimit_pages)) {
> - *p = 0;
> - ratelimit = 0;
> - }
> + dirty = this_cpu_ptr(&bdp_ratelimits);
> +
> /*
> * Pick up the dirtied pages by the exited tasks. This avoids lots of
> * short-lived tasks (eg. gcc invocations in a kernel build) escaping
> * the dirty throttling and livelock other long-run dirtiers.
> */
> - p = this_cpu_ptr(&dirty_throttle_leaks);
> - if (*p > 0 && current->nr_dirtied < ratelimit) {
> - unsigned long nr_pages_dirtied;
> - nr_pages_dirtied = min(*p, ratelimit - current->nr_dirtied);
> - *p -= nr_pages_dirtied;
> - current->nr_dirtied += nr_pages_dirtied;
> + leak = this_cpu_ptr(&dirty_throttle_leaks);
> +
> + if (*dirty + *leak < ratelimit_pages) {
> + /*
> + * nothing to do as it would take some more time to
> + * eat out ratelimit_pages
> + */
> + try_bdp = false;
> + } else {
> + try_bdp = true;
> +
> + /*
> + * bdp in flight helps detect dirty page hogs soon
> + */

How? Please expand on this comment a lot.

> + flights = this_cpu_ptr(&bdp_in_flight);
> +
> + if ((*flights)++ & 1) {

What is that "& 1" doing?

> + *dirty = *dirty + *leak - ratelimit_pages;
> + *leak = 0;
> + }
> }
> preempt_enable();
>
> - if (unlikely(current->nr_dirtied >= ratelimit))
> - balance_dirty_pages(wb, current->nr_dirtied);
> + if (try_bdp)
> + cgwb_bdp(wb);
>
> wb_put(wb);