Re: Re: [PATCH v8 2/3] xen/blkback: Squeeze page pools if a memory pressure is detected

From: SeongJae Park
Date: Fri Dec 13 2019 - 10:03:04 EST


On Fri, 13 Dec 2019 15:34:35 +0100 "Roger Pau MonnÃ" <roger.pau@xxxxxxxxxx> wrote:

> > Each `blkif` has a free pages pool for the grant mapping. The size of
> > the pool starts from zero and is increased on demand while processing
> > the I/O requests. If current I/O requests handling is finished or 100
> > milliseconds has passed since last I/O requests handling, it checks and
> > shrinks the pool to not exceed the size limit, `max_buffer_pages`.
> >
> > Therefore, host administrators can cause memory pressure in blkback by
> > attaching a large number of block devices and inducing I/O. Such
> > problematic situations can be avoided by limiting the maximum number of
> > devices that can be attached, but finding the optimal limit is not so
> > easy. Improper set of the limit can results in memory pressure or a
> > resource underutilization. This commit avoids such problematic
> > situations by squeezing the pools (returns every free page in the pool
> > to the system) for a while (users can set this duration via a module
> > parameter) if memory pressure is detected.
> >
> > Discussions
> > ===========
> >
> > The `blkback`'s original shrinking mechanism returns only pages in the
> > pool which are not currently be used by `blkback` to the system. In
> > other words, the pages that are not mapped with granted pages. Because
> > this commit is changing only the shrink limit but still uses the same
> > freeing mechanism it does not touch pages which are currently mapping
> > grants.
> >
> > Once memory pressure is detected, this commit keeps the squeezing limit
> > for a user-specified time duration. The duration should be neither too
> > long nor too short. If it is too long, the squeezing incurring overhead
> > can reduce the I/O performance. If it is too short, `blkback` will not
> > free enough pages to reduce the memory pressure. This commit sets the
> > value as `10 milliseconds` by default because it is a short time in
> > terms of I/O while it is a long time in terms of memory operations.
> > Also, as the original shrinking mechanism works for at least every 100
> > milliseconds, this could be a somewhat reasonable choice. I also tested
> > other durations (refer to the below section for more details) and
> > confirmed that 10 milliseconds is the one that works best with the test.
> > That said, the proper duration depends on actual configurations and
> > workloads. That's why this commit allows users to set the duration as a
> > module parameter.
> >
> > Memory Pressure Test
> > ====================
> >
> > To show how this commit fixes the memory pressure situation well, I
> > configured a test environment on a xen-running virtualization system.
> > On the `blkfront` running guest instances, I attach a large number of
> > network-backed volume devices and induce I/O to those. Meanwhile, I
> > measure the number of pages that swapped in (pswpin) and out (pswpout)
> > on the `blkback` running guest. The test ran twice, once for the
> > `blkback` before this commit and once for that after this commit. As
> > shown below, this commit has dramatically reduced the memory pressure:
> >
> > pswpin pswpout
> > before 76,672 185,799
> > after 212 3,325
> >
> > Optimal Aggressive Shrinking Duration
> > -------------------------------------
> >
> > To find a best squeezing duration, I repeated the test with three
> > different durations (1ms, 10ms, and 100ms). The results are as below:
> >
> > duration pswpin pswpout
> > 1 852 6,424
> > 10 212 3,325
> > 100 203 3,340
> >
> > As expected, the memory pressure has decreased as the duration is
> > increased, but the reduction stopped from the `10ms`. Based on this
> > results, I chose the default duration as 10ms.
> >
> > Performance Overhead Test
> > =========================
> >
> > This commit could incur I/O performance degradation under severe memory
> > pressure because the squeezing will require more page allocations per
> > I/O. To show the overhead, I artificially made a worst-case squeezing
> > situation and measured the I/O performance of a `blkfront` running
> > guest.
> >
> > For the artificial squeezing, I set the `blkback.max_buffer_pages` using
> > the `/sys/module/xen_blkback/parameters/max_buffer_pages` file. In this
> > test, I set the value to `1024` and `0`. The `1024` is the default
> > value. Setting the value as `0` is same to a situation doing the
> > squeezing always (worst-case).
> >
> > For the I/O performance measurement, I run a simple `dd` command 5 times
> > as below and collect the 'MB/s' results.
> >
> > $ for i in {1..5}; do dd if=/dev/zero of=file \
> > bs=4k count=$((256*512)); sync; done
> >
> > If the underlying block device is slow enough, the squeezing overhead
> > could be hidden. For the reason, I do this test for both a slow block
> > device and a fast block device. I use a popular cloud block storage
> > service, ebs[1] as a slow device and the ramdisk block device[2] for the
> > fast device.
> >
> > The results are as below. 'max_pgs' represents the value of the
> > `blkback.max_buffer_pages` parameter.
> >
> > On the slow block device
> > ------------------------
> >
> > max_pgs Min Max Median Avg Stddev
> > 0 38.7 45.8 38.7 40.12 3.1752165
> > 1024 38.7 45.8 38.7 40.12 3.1752165
> > No difference proven at 95.0% confidence
> >
> > On the fast block device
> > ------------------------
> >
> > max_pgs Min Max Median Avg Stddev
> > 0 417 423 420 419.4 2.5099801
> > 1024 414 425 416 417.8 4.4384682
> > No difference proven at 95.0% confidence
> >
> > In short, even worst case squeezing on ramdisk based fast block device
> > makes no visible performance degradation. Please note that this is just
> > a very simple and minimal test. On systems using super-fast block
> > devices and a special I/O workload, the results might be different. If
> > you have any doubt, test on your machine with your workload to find the
> > optimal squeezing duration for you.
> >
> > [1] https://aws.amazon.com/ebs/
> > [2] https://www.kernel.org/doc/html/latest/admin-guide/blockdev/ramdisk.html
> >
> > Reviewed-by: Juergen Gross <jgross@xxxxxxxx>
>
> You should likely have dropped Juergen RB, since you made some
> non-trivial changes.

Yes, I will!

>
> > Signed-off-by: SeongJae Park <sjpark@xxxxxxxxx>
> > ---
> > .../ABI/testing/sysfs-driver-xen-blkback | 9 ++++++++
> > drivers/block/xen-blkback/blkback.c | 22 +++++++++++++++++--
> > drivers/block/xen-blkback/common.h | 2 ++
> > drivers/block/xen-blkback/xenbus.c | 11 +++++++++-
> > 4 files changed, 41 insertions(+), 3 deletions(-)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-driver-xen-blkback b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > index 4e7babb3ba1f..a74a6d513c9f 100644
> > --- a/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > +++ b/Documentation/ABI/testing/sysfs-driver-xen-blkback
> > @@ -25,3 +25,12 @@ Description:
> > allocated without being in use. The time is in
> > seconds, 0 means indefinitely long.
> > The default is 60 seconds.
> > +
> > +What: /sys/module/xen_blkback/parameters/buffer_squeeze_duration_ms
> > +Date: December 2019
> > +KernelVersion: 5.5
> > +Contact: Roger Pau Monnï <roger.pau@xxxxxxxxxx>
>
> I think you should be the contact for this feature, you are the one
> that implemented it :).
>
> > +Description:
> > + How long the block backend buffers release every free pages in
> > + those under memory pressure. The time is in milliseconds.
>
> "When memory pressure is reported to blkback this option controls the
> duration in milliseconds that blkback will not cache any page not
> backed by a grant mapping. The default is 10ms."

Great, will change!

>
> > + The default is 10 milliseconds.
> > diff --git a/drivers/block/xen-blkback/blkback.c b/drivers/block/xen-blkback/blkback.c
> > index fd1e19f1a49f..26606c4896fd 100644
> > --- a/drivers/block/xen-blkback/blkback.c
> > +++ b/drivers/block/xen-blkback/blkback.c
> > @@ -142,6 +142,21 @@ static inline bool persistent_gnt_timeout(struct persistent_gnt *persistent_gnt)
> > HZ * xen_blkif_pgrant_timeout);
> > }
> >
> > +/* Once a memory pressure is detected, squeeze free page pools for a while. */
> > +static unsigned int buffer_squeeze_duration_ms = 10;
> > +module_param_named(buffer_squeeze_duration_ms,
> > + buffer_squeeze_duration_ms, int, 0644);
> > +MODULE_PARM_DESC(buffer_squeeze_duration_ms,
> > +"Duration in ms to squeeze pages buffer when a memory pressure is detected");
>
> I would place this in xenbus.c so that you don't need the
> xen_blkbk_update_buffer_squeeze_end helper, and can just set
> blkif->buffer_squeeze_end from xen_blkbk_reclaim_memory.

Good point, I will!

>
> > +
> > +static unsigned long buffer_squeeze_end;
>
> This variable should be removed...
>
> > +
> > +void xen_blkbk_update_buffer_squeeze_end(struct xen_blkif *blkif)
> > +{
> > + blkif->buffer_squeeze_end = jiffies +
> > + msecs_to_jiffies(buffer_squeeze_duration_ms);
> > +}
> > +
> > static inline int get_free_page(struct xen_blkif_ring *ring, struct page **page)
> > {
> > unsigned long flags;
> > @@ -656,8 +671,11 @@ int xen_blkif_schedule(void *arg)
> > ring->next_lru = jiffies + msecs_to_jiffies(LRU_INTERVAL);
> > }
> >
> > - /* Shrink if we have more than xen_blkif_max_buffer_pages */
> > - shrink_free_pagepool(ring, xen_blkif_max_buffer_pages);
> > + /* Shrink the free pages pool if it is too large. */
> > + if (time_before(jiffies, buffer_squeeze_end))
>
> ... and this comparison needs to use blkif->buffer_squeeze_end
> instead.

Ooops, I made so dumb mistakes... Will fix it.

>
> > + shrink_free_pagepool(ring, 0);
> > + else
> > + shrink_free_pagepool(ring, xen_blkif_max_buffer_pages);
> >
> > if (log_stats && time_after(jiffies, ring->st_print))
> > print_stats(ring);
> > diff --git a/drivers/block/xen-blkback/common.h b/drivers/block/xen-blkback/common.h
> > index 1d3002d773f7..ba653126177d 100644
> > --- a/drivers/block/xen-blkback/common.h
> > +++ b/drivers/block/xen-blkback/common.h
> > @@ -319,6 +319,7 @@ struct xen_blkif {
> > /* All rings for this device. */
> > struct xen_blkif_ring *rings;
> > unsigned int nr_rings;
> > + unsigned long buffer_squeeze_end;
> > };
> >
> > struct seg_buf {
> > @@ -383,6 +384,7 @@ irqreturn_t xen_blkif_be_int(int irq, void *dev_id);
> > int xen_blkif_schedule(void *arg);
> > int xen_blkif_purge_persistent(void *arg);
> > void xen_blkbk_free_caches(struct xen_blkif_ring *ring);
> > +void xen_blkbk_update_buffer_squeeze_end(struct xen_blkif *blkif);
> >
> > int xen_blkbk_flush_diskcache(struct xenbus_transaction xbt,
> > struct backend_info *be, int state);
> > diff --git a/drivers/block/xen-blkback/xenbus.c b/drivers/block/xen-blkback/xenbus.c
> > index b90dbcd99c03..09fe6cb5c4ea 100644
> > --- a/drivers/block/xen-blkback/xenbus.c
> > +++ b/drivers/block/xen-blkback/xenbus.c
> > @@ -824,6 +824,14 @@ static void frontend_changed(struct xenbus_device *dev,
> > }
> >
> >
>
> I would place the module_param_named instance here, so it's close as
> possible to it's only user.

Good suggestion!

>
> > +void xen_blkbk_reclaim_memory(struct xenbus_device *dev)
>
> This can be static and drop the xen_blkbk prefix AFAICT.
>
> > +{
> > + struct backend_info *be = dev_get_drvdata(&dev->dev);
> > +
> > + xen_blkbk_update_buffer_squeeze_end(be->blkif);
>
> Set blkif->buffer_squeeze_end here.
>
> > +}
> > +
> > +
>
> Extra newline.

I thought its a rule to use two newlines between functions here, but seems it
was just a trivial nit. Will fix and send next version soon!


Thanks,
SeongJae Park

>
> Thanks, Roger.
>