Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt
From: Ross Zwisler
Date: Fri Aug 04 2017 - 14:21:18 EST
On Fri, Aug 04, 2017 at 11:01:08AM -0700, Dan Williams wrote:
> [ adding Dave who is working on a blk-mq + dma offload version of the
> pmem driver ]
> On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> > On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
> >> Thanks for the testing. Your testing number is within noise level?
> >> I cannot understand why PMEM doesn't have enough gain while BTT is significant
> >> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
> >> allocation and mine and rw_page testing reduced it significantly. However,
> >> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
> >> to the device is so fast so the number comes from purely the number of
> >> instructions has done. At a quick glance of bio init/submit, it's not trivial
> >> so indeed, i understand where the 12% enhancement comes from but I'm not sure
> >> it's really big difference in real practice at the cost of maintaince burden.
> > I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> > In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
> > I guess it's really hard to get stable result in severe memory pressure.
> > It would be a result within noise level(see below stddev).
> > So, I think it's hard to conclude rw_page is far faster than onstack-bio.
> > rw_page
> > avg 5.54us
> > stddev 8.89%
> > max 6.02us
> > min 4.20us
> > onstack bio
> > avg 5.27us
> > stddev 13.03%
> > max 5.96us
> > min 3.55us
> The maintenance burden of having alternative submission paths is
> significant especially as we consider the pmem driver ising more
> services of the core block layer. Ideally, I'd want to complete the
> rw_page removal work before we look at the blk-mq + dma offload
> The change to introduce BDI_CAP_SYNC is interesting because we might
> have use for switching between dma offload and cpu copy based on
> whether the I/O is synchronous or otherwise hinted to be a low latency
> request. Right now the dma offload patches are using "bio_segments() >
> 1" as the gate for selecting offload vs cpu copy which seem
Okay, so based on the feedback above and from Jens, it sounds like we want
to go forward with removing the rw_page() interface, and instead optimize the
regular I/O path via on-stack BIOS and dma offload, correct?
If so, I'll prepare patches that fully remove the rw_page() code, and let
Minchan and Dave work on their optimizations.