Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

From: Minchan Kim
Date: Thu Aug 03 2017 - 23:54:49 EST


On Thu, Aug 03, 2017 at 03:13:35PM -0600, Ross Zwisler wrote:
> On Thu, Aug 03, 2017 at 09:13:15AM +0900, Minchan Kim wrote:
> > Hi Ross,
> >
> > On Wed, Aug 02, 2017 at 04:13:59PM -0600, Ross Zwisler wrote:
> > > On Fri, Jul 28, 2017 at 10:31:43AM -0700, Matthew Wilcox wrote:
> > > > On Fri, Jul 28, 2017 at 10:56:01AM -0600, Ross Zwisler wrote:
> > > > > Dan Williams and Christoph Hellwig have recently expressed doubt about
> > > > > whether the rw_page() interface made sense for synchronous memory drivers
> > > > > [1][2]. It's unclear whether this interface has any performance benefit
> > > > > for these drivers, but as we continue to fix bugs it is clear that it does
> > > > > have a maintenance burden. This series removes the rw_page()
> > > > > implementations in brd, pmem and btt to relieve this burden.
> > > >
> > > > Why don't you measure whether it has performance benefits? I don't
> > > > understand why zram would see performance benefits and not other drivers.
> > > > If it's going to be removed, then the whole interface should be removed,
> > > > not just have the implementations removed from some drivers.
> > >
> > > Okay, I've run a bunch of performance tests with the PMEM and with BTT entry
> > > points for rw_pages() in a swap workload, and in all cases I do see an
> > > improvement over the code when rw_pages() is removed. Here are the results
> > > from my random lab box:
> > >
> > > Average latency of swap_writepage()
> > > +------+------------+---------+-------------+
> > > | | no rw_page | rw_page | Improvement |
> > > +-------------------------------------------+
> > > | PMEM | 5.0 us | 4.7 us | 6% |
> > > +-------------------------------------------+
> > > | BTT | 6.8 us | 6.1 us | 10% |
> > > +------+------------+---------+-------------+
> > >
> > > Average latency of swap_readpage()
> > > +------+------------+---------+-------------+
> > > | | no rw_page | rw_page | Improvement |
> > > +-------------------------------------------+
> > > | PMEM | 3.3 us | 2.9 us | 12% |
> > > +-------------------------------------------+
> > > | BTT | 3.7 us | 3.4 us | 8% |
> > > +------+------------+---------+-------------+
> > >
> > > The workload was pmbench, a memory benchmark, run on a system where I had
> > > severely restricted the amount of memory in the system with the 'mem' kernel
> > > command line parameter. The benchmark was set up to test more memory than I
> > > allowed the OS to have so it spilled over into swap.
> > >
> > > The PMEM or BTT device was set up as my swap device, and during the test I got
> > > a few hundred thousand samples of each of swap_writepage() and
> > > swap_writepage(). The PMEM/BTT device was just memory reserved with the
> > > memmap kernel command line parameter.
> > >
> > > Thanks, Matthew, for asking for performance data. It looks like removing this
> > > code would have been a mistake.
> >
> > By suggestion of Christoph Hellwig, I made a quick patch which does IO without
> > dynamic bio allocation for swap IO. Actually, it's not formal patch to be
> > worth to send mainline yet but I believe it's enough to test the improvement.
> >
> > Could you test patchset on pmem and btt without rw_page?
> >
> > For working the patch, block drivers need to declare it's synchronous IO
> > device via BDI_CAP_SYNC but if it's hard, you can just make every swap IO
> > comes from (sis->flags & SWP_SYNC_IO) with removing condition check
> >
> > if (!(sis->flags & SWP_SYNC_IO)) in swap_[read|write]page.
> >
> > Patchset is based on 4.13-rc3.
>
> Thanks for the patch, here are the updated results from my test box:
>
> Average latency of swap_writepage()
> +------+------------+---------+---------+
> | | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM | 5.0 us | 4.98 us | 4.7 us |
> +----------------------------------------
> | BTT | 6.8 us | 6.3 us | 6.1 us |
> +------+------------+---------+---------+
>
> Average latency of swap_readpage()
> +------+------------+---------+---------+
> | | no rw_page | minchan | rw_page |
> +----------------------------------------
> | PMEM | 3.3 us | 3.27 us | 2.9 us |
> +----------------------------------------
> | BTT | 3.7 us | 3.44 us | 3.4 us |
> +------+------------+---------+---------+
>
> I've added another digit in precision in some cases to help differentiate the
> various results.
>
> In all cases your patches did perform better than with the regularly allocated
> BIO, but again for all cases the rw_page() path was the fastest, even if only
> marginally.

Thanks for the testing. Your testing number is within noise level?

I cannot understand why PMEM doesn't have enough gain while BTT is significant
win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
allocation and mine and rw_page testing reduced it significantly. However,
in no rw_page with pmem, there wasn't many cases to wait bio allocations due
to the device is so fast so the number comes from purely the number of
instructions has done. At a quick glance of bio init/submit, it's not trivial
so indeed, i understand where the 12% enhancement comes from but I'm not sure
it's really big difference in real practice at the cost of maintaince burden.