Re: [PATCH 3/3] block/diskstats: more accurate approximation of io_ticks for slow disks

From: Alan Jenkins
Date: Tue May 07 2019 - 05:14:54 EST


On 01/04/2019 16:59, Konstantin Khlebnikov wrote:
Currently io_ticks is approximated by adding one at each
start and end of requests if jiffies has changed.
This works perfectly for requests shorter than a jiffy.

Fix for slow requests is simple: at the end of request add
count of jiffies passed since last update.

Fixes: 5b18b5a73760 ("block: delete part_round_stats and switch to less precise counting")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>

Thanks for working on this! I noticed the problem behaviour using the Fedora 29 kernel [1]. I wasn't sure how it could be fixed. Now I found this patch series, but I still have some questions :-).

[1] https://unix.stackexchange.com/questions/517132/dd-is-running-at-full-speed-but-i-only-see-20-disk-utilization-why

With these patches, `atopsar -d 2` shows about 100% "busy" when running a simple `dd` command, instead of 20 or 35%. So that looked promising.

I saw some samples showing 112, 113, and 114% utilization. Unfortunately I'm not sure exactly how to reproduce that. I think it happened during filesystem buffered writes (i.e. `dd` without `oflag=direct`), with the IO scheduler set to "none", on my SATA HDD.

Getting some "101% busy" samples seemed fairly easy to trigger, but I am not sure whether that is just a rounding error in `atopsar` :-(.


Q1)

> Fix for slow requests is simple: at the end of request add
> count of jiffies passed since last update.

Even considering the simple case of a single CPU, the approximation may be "less accurate" when requests complete out of order. Is that right?

t 1 10 20 30
io1 start end
io2 start end
io_ticks 1 2 11 21

^^^^^^
\
9 ticks not accounted as "busy"


At least, I found something fun happens if I run `dd if=/dev/urandom of=~/t bs=1M oflag=direct` at the same time as `dd if=/dev/sda of=/dev/null bs=1M oflag=direct` . With scheduler=none, it reduces "busy" from 100%, down to 97%. With scheduler=mq-deadline, it reduces "busy" from 100% to 60% :-). Even though the system "iowait" is about 100% (equivalent to one CPU).

(Environment: My /dev/sda max read speed is about 150MB/s. My /dev/urandom read speed is about 140 MB/s. I have 4 logical CPUs).

It feels like it should be possible to improve io_ticks, by basically changing the condition in your fix, from (end==true), to (inflight>0). Would that make sense?


Q2) But what most confuses me, is that I think `io_ticks` is defined as a per-cpu field. And the per-cpu fields are summed into one value, which is reported to userspace.

I have tried playing with a couple `dd iflag=direct` readers, using `taskset` to pin them to different CPUs, but I only got 98-102% "busy". It did not rise to 200% :-).

i) Is it possible to see 200% "busy", due to per-cpu ioticks?

ii) If so, then can per-cpu ioticks also cause us to see 100% "busy", when IO's were only inflight for 50% of the time (but on two different CPUs)?

If it is possible to trigger both of these cases, this metric seems very difficult to trust and use in any reliable way. It seems preferable to disable it altogether and force people to use more trustworthy metrics. E.g. system-wide "iowait", and iostat field 11 "weighted # of milliseconds spent doing I/Os" which `iostat` uses to show "average queue length".

Or the "special pleading" approach? Should ioticks accounted at the level of the hardware queue, and be disabled if the device has is using more than one hardware queue?


Q3) In case I am mistaken in some way, and Q2 is not an issue at all:

I still think reporting over 100% utilization is something new. At least the comments I see were removed in the "Fixes" commit seem to agree. That one error of 14% ("114% busy") that I saw, seems fairly big :-).

I wonder if we can better explain how much of a rough approximation this metric is now, e.g. in Documentation/iostats.txt ?

So far I don't know what the real description and limitations would be...

I think it would be understandable e.g. if we were able to say "busy%" should show around 100% when the device is used around 100% of the time, and definitely 0% when it is idle, but is probably not as accurate as "iowait". ("iowait" reported by the CPU scheduler. Not to say these are the same or equivalent. And I understand "iowait" is another ball of approximations and confusion.)


Regards
Alan


---
block/bio.c | 8 ++++----
block/blk-core.c | 4 ++--
include/linux/genhd.h | 2 +-
3 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index c0a60f3e9b7b..245056797999 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1729,14 +1729,14 @@ void bio_check_pages_dirty(struct bio *bio)
schedule_work(&bio_dirty_work);
}
-void update_io_ticks(struct hd_struct *part, unsigned long now)
+void update_io_ticks(struct hd_struct *part, unsigned long now, bool end)
{
unsigned long stamp;
again:
stamp = READ_ONCE(part->stamp);
if (unlikely(stamp != now)) {
if (likely(cmpxchg(&part->stamp, stamp, now) == stamp)) {
- __part_stat_add(part, io_ticks, 1);
+ __part_stat_add(part, io_ticks, end ? now - stamp : 1);
}
}
if (part->partno) {
@@ -1752,7 +1752,7 @@ void generic_start_io_acct(struct request_queue *q, int op,
part_stat_lock();
- update_io_ticks(part, jiffies);
+ update_io_ticks(part, jiffies, false);
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, sectors[sgrp], sectors);
part_inc_in_flight(q, part, op_is_write(op));
@@ -1770,7 +1770,7 @@ void generic_end_io_acct(struct request_queue *q, int req_op,
part_stat_lock();
- update_io_ticks(part, now);
+ update_io_ticks(part, now, true);
part_stat_add(part, nsecs[sgrp], jiffies_to_nsecs(duration));
part_dec_in_flight(q, part, op_is_write(req_op));
diff --git a/block/blk-core.c b/block/blk-core.c
index d89168b167e9..6e8f0b9e7731 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1334,7 +1334,7 @@ void blk_account_io_done(struct request *req, u64 now)
part_stat_lock();
part = req->part;
- update_io_ticks(part, jiffies);
+ update_io_ticks(part, jiffies, true);
part_stat_inc(part, ios[sgrp]);
part_stat_add(part, nsecs[sgrp], now - req->start_time_ns);
part_dec_in_flight(req->q, part, rq_data_dir(req));
@@ -1375,7 +1375,7 @@ void blk_account_io_start(struct request *rq, bool new_io)
rq->part = part;
}
- update_io_ticks(part, jiffies);
+ update_io_ticks(part, jiffies, false);
part_stat_unlock();
}
diff --git a/include/linux/genhd.h b/include/linux/genhd.h
index 2f5a9ed7e86e..8ece8e02c609 100644
--- a/include/linux/genhd.h
+++ b/include/linux/genhd.h
@@ -410,7 +410,7 @@ static inline void free_part_info(struct hd_struct *part)
kfree(part->info);
}
-void update_io_ticks(struct hd_struct *part, unsigned long now);
+void update_io_ticks(struct hd_struct *part, unsigned long now, bool end);
/* block/genhd.c */
extern void device_add_disk(struct device *parent, struct gendisk *disk,