Re: [PATCH 0/3] remove rw_page() from brd, pmem and btt

From: Dan Williams
Date: Fri Aug 04 2017 - 14:01:18 EST


[ adding Dave who is working on a blk-mq + dma offload version of the
pmem driver ]

On Fri, Aug 4, 2017 at 1:17 AM, Minchan Kim <minchan@xxxxxxxxxx> wrote:
> On Fri, Aug 04, 2017 at 12:54:41PM +0900, Minchan Kim wrote:
[..]
>> Thanks for the testing. Your testing number is within noise level?
>>
>> I cannot understand why PMEM doesn't have enough gain while BTT is significant
>> win(8%). I guess no rw_page with BTT testing had more chances to wait bio dynamic
>> allocation and mine and rw_page testing reduced it significantly. However,
>> in no rw_page with pmem, there wasn't many cases to wait bio allocations due
>> to the device is so fast so the number comes from purely the number of
>> instructions has done. At a quick glance of bio init/submit, it's not trivial
>> so indeed, i understand where the 12% enhancement comes from but I'm not sure
>> it's really big difference in real practice at the cost of maintaince burden.
>
> I tested pmbench 10 times in my local machine(4 core) with zram-swap.
> In my machine, even, on-stack bio is faster than rw_page. Unbelievable.
>
> I guess it's really hard to get stable result in severe memory pressure.
> It would be a result within noise level(see below stddev).
> So, I think it's hard to conclude rw_page is far faster than onstack-bio.
>
> rw_page
> avg 5.54us
> stddev 8.89%
> max 6.02us
> min 4.20us
>
> onstack bio
> avg 5.27us
> stddev 13.03%
> max 5.96us
> min 3.55us

The maintenance burden of having alternative submission paths is
significant especially as we consider the pmem driver ising more
services of the core block layer. Ideally, I'd want to complete the
rw_page removal work before we look at the blk-mq + dma offload
reworks.

The change to introduce BDI_CAP_SYNC is interesting because we might
have use for switching between dma offload and cpu copy based on
whether the I/O is synchronous or otherwise hinted to be a low latency
request. Right now the dma offload patches are using "bio_segments() >
1" as the gate for selecting offload vs cpu copy which seem
inadequate.