Re: [PATCH] block, bfq: keep peak_rate estimation within range 1..2^32-1

From: Konstantin Khlebnikov
Date: Mon Mar 26 2018 - 10:18:28 EST


On 26.03.2018 17:06, Paolo Valente wrote:


Il giorno 26 mar 2018, alle ore 12:28, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto:



On 26.03.2018 11:01, Paolo Valente wrote:
Il giorno 21 mar 2018, alle ore 00:49, Paolo Valente <paolo.valente@xxxxxxxxxx> ha scritto:



Il giorno 20 mar 2018, alle ore 15:41, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto:

On 20.03.2018 06:00, Paolo Valente wrote:
Il giorno 19 mar 2018, alle ore 14:28, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto:

On 19.03.2018 09:03, Paolo Valente wrote:
Il giorno 05 mar 2018, alle ore 04:48, Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx> ha scritto:

Rate should never overflow or become zero because it is used as divider.
This patch accumulates it with saturation.

Signed-off-by: Konstantin Khlebnikov <khlebnikov@xxxxxxxxxxxxxx>
---
block/bfq-iosched.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/block/bfq-iosched.c b/block/bfq-iosched.c
index aeca22d91101..a236c8d541b5 100644
--- a/block/bfq-iosched.c
+++ b/block/bfq-iosched.c
@@ -2546,7 +2546,8 @@ static void bfq_reset_rate_computation(struct bfq_data *bfqd,

static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
{
- u32 rate, weight, divisor;
+ u32 weight, divisor;
+ u64 rate;

/*
* For the convergence property to hold (see comments on
@@ -2634,9 +2635,10 @@ static void bfq_update_rate_reset(struct bfq_data *bfqd, struct request *rq)
*/
bfqd->peak_rate *= divisor-1;
bfqd->peak_rate /= divisor;
- rate /= divisor; /* smoothing constant alpha = 1/divisor */
+ do_div(rate, divisor); /* smoothing constant alpha = 1/divisor */

- bfqd->peak_rate += rate;
+ /* rate should never overlow or become zero */
It is bfqd->peak_rate that is used as a divider, and bfqd->peak_rate doesn't risk to be zero even if the variable 'rate' is zero here.
So I guess the reason why you consider the possibility that bfqd->peak_rate becomes zero is because of an overflow when summing 'rate'. But, according to my calculations, this should be impossible with devices with sensible speeds.
These are the reasons why I decided I could make it with a 32-bit variable, without any additional clamping. Did I make any mistake in my evaluation?

According to Murphy's law this is inevitable..

Yep. Actually Murphy has been even clement this time, by making the
failure occur to a kernel expert :)
I've seen couple division by zero crashes in bfq_wr_duration.
Unfortunately logs weren't recorded.

Anyway, even if I made some mistake about the maximum possible value of the device rate, and the latter may be too high for bfqd->peak_rate to contain it, then I guess the right solution would not be to clamp the actual rate to U32_MAX, but to move bfqd->peak_rate to 64 bits. Or am I missing something else?
+ bfqd->peak_rate = clamp_t(u64, rate + bfqd->peak_rate, 1, U32_MAX);

32-bit should be enough and better for division.
My patch makes sure it never overflows/underflows.
That's cheaper than full 64-bit/64-bit division.
Anyway 64-bit speed could overflow too. =)

I see your point. Still, if the mistake is not in sizing, then you
bumped into some odd bug. In this respect, I don't like much the idea
of sweeping the dust under the carpet. So, let me ask you for a
little bit more help. With your patch applied, and thus with no risk
of crashes, what about adding, right before your clamp_t, something
like:
if (!bfqd->peak_rate)
pr_crit(<dump of all the variables involved in updating bfqd->peak_rate>);
Once the failure shows up (Murphy permitting), we might have hints to
the bug causing it.
Apart from that, I have no problem with patches that make bfq more
robust, even in a sort of black-box way.

This rate estimator is already works as a black-box with smoothing and
low-pass filter inside.

Actually, it is just what you say: a low-pass filter, which works by
deflating two non-null quantities (the previous and the current rate
rate), by and adding these quantities to each other (to get the new
estimated rate). In addition to being larger than 0, both quantities
are much lower than U32_MAX, so this computation should never yield
zero because of an overflow.

Even some occasional zeros read for the sampled rate (because of the
suspicion you express below), cannot lead to a null estimated rate.
In fact, this is a low-pass filter, meant to cut off occasional
spurious values.

I'm insisting on this part, just because, maybe, you say that it "acts
as a black-box" because there is something more, which I'm overlooking
here, and maybe the bug is exactly in the part I'm overlooking. But,
if this is not the case, I'll just stop now for this point.

It has limitations thus this is ok to declare
range of expected\sane results.


Correct, but if the odd value you cut off is a random value that
follows from a bug (and I see no other possibility, in view of the
above arguments), then the result may still be completely wrong. For
example:
- the device may have a rate of, say, 200 MB/s
- the computation of the new estimated rate yields 0
- you limit this result to ~0 MB/s but >0 MB/s; this avoids division
errors, but still leaves you with a completely wrong rate, and a
consequently inconsistent behavior of the scheduler

It would be nice to show estimated rate in sysfs to let user
diagnose whenever it is wrong for a long period of time.

Absolutely. This is actually already shown indirectly, through the
budget_max parameter, which is the number of sectors served at peak
rate during slice_sync.

Printing message all the time even ratelimited is a wrong idea.
If this should never happens - WARN_ONCE is more than enough.


I'm sorry, I was totally unclear here. My proposal for you is:
since this failure does occur in your system, please add, temporarily, the
printing of all involved values *all the times*, so that you can track down
the bug, and fix the actual problem leading to this division by zero.

What do you think?

I suspect crashes might be caused by bumpy timer which was affected by smi/nmi from mce.

Yep. The problem is that, for the estimated rate to go to zero, a
null value must be sampled for several consecutive times.

I'll try to find evidence.

Getting to the bottom of this would be really great.

Looking forward to your findings,
I guess no news yet. Anyway, your bug report has pushed me to think
over the representation of the rate. And I may have found a problem,
which might be the cause of the zeros you have seen.
With a little math, it comes out that, for a sampled rate below about
5 KB/s, 0 is stored in the variable 'rate'. The reason is that such
low values cannot be represented with the maximum precision currently
available in BFQ for rates. So, if these very low values are sampled
for a few consecutive times, then the estimated peak rate may actually
become zero.
In the system in which you have seen null estimated rates, may the
virtual or physical storage device be occasionally so slow?

I've found couple crashes from qemu-kvm virtual machines.
Their disks are slow for sure. Zero rate for a minute wouldn't be a surprise.


Ok, looks like we are converging. Hoping that it is ok for you, I
took the liberty to prepare a patch that only lower-bounds the
estimated rate to 1, and that, to prevent future problems of this
kind, adds comments to report the range of peak rates representable.
I have added you as signer of the patch, being this new patch just a
subset of yours.

If this patch is the right solution to the failure you detected, then
the latter should just go away.

Thank you,
Paolo

Ok. Thanks. I'll pick your patch into our kernel.


<4>[125253.240885] divide error: 0000 [#1] SMP PTI
<4>[125253.241031] Modules linked in: netconsole configfs tcp_diag inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc 8021q garp mrp stp llc drm_kms_helper syscopyarea sysfillrect sysimgblt snd_hda_codec_generic fb_sys_fops snd_hda_intel ttm drm snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer snd soundcore i2c_piix4 ppdev kvm_intel kvm irqbypass input_leds parport_pc fuse lp parport xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
<4>[125253.242257] async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ata_generic pata_acpi floppy [last unloaded: netconsole]
<4>[125253.242537] CPU: 0 PID: 36766 Comm: push-client Not tainted 4.14.15-5 #1
<4>[125253.242731] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
<4>[125253.242928] task: ffff880210ed3500 task.stack: ffffc90002d78000
<4>[125253.243092] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40
<4>[125253.243240] RSP: 0018:ffffc90002d7b9e0 EFLAGS: 00010046
<4>[125253.243357] RAX: 00000000000d59f8 RBX: ffff88023014af00 RCX: 000000000000001e
<4>[125253.243567] RDX: 0000000000000000 RSI: ffff8802302685a0 RDI: 0000000000000000
<4>[125253.243785] RBP: ffff880230268400 R08: 0000000000000001 R09: 0000000000000001
<4>[125253.244004] R10: ffff8802305341b0 R11: ffff880230268400 R12: ffff88023066cb60
<4>[125253.244222] R13: 0000000000000000 R14: ffff8802305341b0 R15: ffffc90002d7ba88
<4>[125253.244489] FS: 00007f74d59e6700(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
<4>[125253.244784] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[125253.245073] CR2: 00007f78e7261258 CR3: 000000021275e000 CR4: 00000000000006f0
<4>[125253.245406] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[125253.245696] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[125253.245961] Call Trace:
<4>[125253.246097] bfq_insert_requests+0xda7/0xf60
<4>[125253.246313] blk_mq_sched_insert_request+0xf9/0x150
<4>[125253.246520] blk_mq_make_request+0x1da/0x630
<4>[125253.246692] generic_make_request+0xfb/0x2d0
<4>[125253.246847] ? submit_bio+0x73/0x150
<4>[125253.246984] submit_bio+0x73/0x150
<4>[125253.247229] xfs_submit_ioend+0x80/0x1a0 [xfs]
<4>[125253.247466] xfs_do_writepage+0x4d7/0x710 [xfs]
<4>[125253.247636] ? invalid_page_referenced_vma+0x90/0x90
<4>[125253.247906] write_cache_pages+0x23a/0x4e0
<4>[125253.248166] ? xfs_add_to_ioend+0x240/0x240 [xfs]
<4>[125253.248345] ? dequeue_entity+0x2f4/0xa90
<4>[125253.248596] xfs_vm_writepages+0x94/0xd0 [xfs]
<4>[125253.248793] do_writepages+0x17/0x70
<4>[125253.249020] __filemap_fdatawrite_range+0xa7/0xe0
<4>[125253.249269] file_write_and_wait_range+0x4c/0xa0
<4>[125253.249514] xfs_file_fsync+0x55/0x220 [xfs]
<4>[125253.249719] do_fsync+0x38/0x60
<4>[125253.249877] SyS_fsync+0xc/0x10
<4>[125253.250098] entry_SYSCALL_64_fastpath+0x20/0x83
<4>[125253.250285] RIP: 0033:0x7f74e92a6a7d
<4>[125253.250484] RSP: 002b:00007f74d59dcc90 EFLAGS: 00000293
<4>[125253.250485] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 0f 48
<1>[125253.251304] RIP: bfq_wr_duration.isra.15.part.16+0xc/0x40 RSP: ffffc90002d7b9e0

And this one definitely have timer issues.

<4>[69620.113311] hpet1: lost 1123 rtc interrupts
<4>[69725.670214] hpet1: lost 1257 rtc interrupts
<4>[69730.301092] hpet1: lost 295 rtc interrupts
<4>[69785.817074] hpet1: lost 804 rtc interrupts
<4>[70276.099325] hpet1: lost 1138 rtc interrupts
<4>[70338.962431] hpet1: lost 1274 rtc interrupts
<4>[70687.531960] hpet1: lost 314 rtc interrupts
<4>[70708.789681] hpet1: lost 1359 rtc interrupts
<4>[70892.699792] hpet1: lost 775 rtc interrupts
<4>[70985.631630] hpet1: lost 448 rtc interrupts
<4>[71045.952024] hpet1: lost 1109 rtc interrupts
<4>[71236.781185] hpet1: lost 1216 rtc interrupts
<4>[71253.521384] hpet1: lost 1071 rtc interrupts
<4>[71477.087387] hpet1: lost 563 rtc interrupts
<4>[71568.587212] hpet1: lost 357 rtc interrupts
<4>[71837.330897] hpet1: lost 705 rtc interrupts
<4>[72029.244809] hpet1: lost 1286 rtc interrupts
<4>[72251.173477] hpet1: lost 458 rtc interrupts
<4>[72398.366782] hpet1: lost 1172 rtc interrupts
<4>[73218.363935] hpet1: lost 249 rtc interrupts
<4>[73280.472760] hpet1: lost 1225 rtc interrupts
<4>[73458.506121] hpet1: lost 398 rtc interrupts
<4>[73504.331306] hpet1: lost 183 rtc interrupts
<4>[73596.270009] hpet1: lost 386 rtc interrupts
<4>[73858.989240] hpet1: lost 320 rtc interrupts
<4>[73903.326725] hpet1: lost 88 rtc interrupts
<4>[74170.760022] hpet1: lost 622 rtc interrupts
<4>[74517.835154] hpet1: lost 216 rtc interrupts
<4>[74609.630248] hpet1: lost 375 rtc interrupts
<4>[74884.937850] hpet1: lost 1126 rtc interrupts
<4>[74886.291818] hpet1: lost 86 rtc interrupts
<4>[74902.415909] hpet1: lost 1031 rtc interrupts
<4>[75134.991339] hpet1: lost 1140 rtc interrupts
<4>[75139.948051] hpet1: lost 316 rtc interrupts
<4>[75183.709393] hpet1: lost 50 rtc interrupts
<4>[75486.178692] hpet1: lost 115 rtc interrupts
<4>[75634.770692] hpet1: lost 1261 rtc interrupts
<4>[76322.311822] hpet1: lost 18 rtc interrupts
<4>[76941.322201] hpet1: lost 1128 rtc interrupts
<4>[77341.716930] hpet1: lost 884 rtc interrupts
<4>[77347.532830] hpet1: lost 371 rtc interrupts
<4>[77348.620826] hpet1: lost 68 rtc interrupts
<4>[77362.430004] divide error: 0000 [#1] SMP PTI
<4>[77362.430139] Modules linked in: netconsole configfs tcp_diag inet_diag unix_diag nf_log_ipv6 nf_log_common xt_LOG xt_u32 ip6t_REJECT nf_reject_ipv6 xt_conntrack xt_multiport ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_filter ip6_tables ipt_REJECT nf_reject_ipv4 xt_tcpudp xt_mark xt_owner iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_filter ip_tables x_tables nfsd auth_rpcgss nfs_acl nfs lockd grace sunrpc snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm snd_timer drm_kms_helper snd syscopyarea sysfillrect sysimgblt fb_sys_fops ttm soundcore drm 8021q garp mrp stp llc ppdev i2c_piix4 kvm_intel input_leds kvm irqbypass parport_pc fuse lp parport xfs raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor
<4>[77362.431347] async_tx xor raid6_pq libcrc32c raid1 raid0 floppy multipath linear ata_generic pata_acpi [last unloaded: netconsole]
<4>[77362.431545] CPU: 0 PID: 39134 Comm: skynet.heartbea Not tainted 4.14.15-5 #1
<4>[77362.431720] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
<4>[77362.432013] task: ffff8800964adcc0 task.stack: ffffc900030bc000
<4>[77362.432185] RIP: 0010:bfq_wr_duration.isra.15.part.16+0xc/0x40
<4>[77362.432348] RSP: 0018:ffffc900030bfa90 EFLAGS: 00010046
<4>[77362.432552] RAX: 00000000000d59f8 RBX: ffff8801348d6900 RCX: 000000000000001e
<4>[77362.432725] RDX: 0000000000000000 RSI: ffff8801348c1da0 RDI: 0000000000000000
<4>[77362.432992] RBP: ffff8801348c1c00 R08: 0000000000000001 R09: 0000000000000001
<4>[77362.433214] R10: ffff8800b9eacf30 R11: ffff8801348c1c00 R12: ffff880134decb60
<4>[77362.433466] R13: 0000000000000000 R14: ffff8800b9eacf30 R15: ffffc900030bfb38
<4>[77362.433696] FS: 00007fe28c721700(0000) GS:ffff88013fc00000(0000) knlGS:0000000000000000
<4>[77362.433948] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4>[77362.434125] CR2: 0000000001370ec8 CR3: 0000000135b2c000 CR4: 00000000000006f0
<4>[77362.434328] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>[77362.434571] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4>[77362.434828] Call Trace:
<4>[77362.434997] bfq_insert_requests+0xda7/0xf60
<4>[77362.435150] blk_mq_sched_insert_request+0xf9/0x150
<4>[77362.435366] blk_mq_make_request+0x1da/0x630
<4>[77362.435516] generic_make_request+0xfb/0x2d0
<4>[77362.435701] ? submit_bio+0x73/0x150
<4>[77362.435827] submit_bio+0x73/0x150
<4>[77362.436100] xfs_submit_ioend+0x80/0x1a0 [xfs]
<4>[77362.436339] xfs_vm_writepages+0xa8/0xd0 [xfs]
<4>[77362.436549] do_writepages+0x17/0x70
<4>[77362.436692] __filemap_fdatawrite_range+0xa7/0xe0
<4>[77362.436884] filemap_write_and_wait_range+0x30/0x80
<4>[77362.437106] xfs_setattr_size+0x121/0x380 [xfs]
<4>[77362.437313] xfs_vn_setattr+0x6b/0x90 [xfs]
<4>[77362.437472] notify_change+0x2da/0x400
<4>[77362.437663] do_truncate+0x5b/0x90
<4>[77362.437842] do_sys_ftruncate.constprop.18+0x10b/0x170
<4>[77362.438033] ? SyS_write+0x66/0x90
<4>[77362.438251] entry_SYSCALL_64_fastpath+0x20/0x83
<4>[77362.438429] RIP: 0033:0x7fe293e8bd87
<4>[77362.438586] RSP: 002b:00007fe28c71fb28 EFLAGS: 00000246
<4>[77362.438587] Code: ff ff 8b 07 4d 29 c8 4c 0f af c0 e9 27 ff ff ff 48 89 c8 c3 90 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 06 31 d2 89 ff <48> f7 f7 48 89 c1 b8 b2 0c 00 00 48 81 f9 b2 0c 00 00 77 0f 48


Thanks,
Paolo
Paolo


Thanks a lot,
Paolo

update_thr_responsiveness_params(bfqd);

reset_computation: