Re: [PATCH 2/2] blk-throttle: fix off-by-one jiffies wait_time

From: Yu Kuai
Date: Sun Feb 23 2025 - 21:39:36 EST


Hi,

在 2025/02/22 20:16, Ming Lei 写道:
Hi Yukuai,

On Sat, Feb 22, 2025 at 05:28:23PM +0800, Yu Kuai wrote:
From: Yu Kuai <yukuai3@xxxxxxxxxx>

wait_time is based on jiffies, and the calculation is:

wait_time = extra_bytes * HZ / bps_limit;

If wait time is not zero, the remainder is ignored, means wait time is
short by at most 1 jiffes; On the other hand, if wait time is zero, it
will be used as 1 jiffies, means it's excessive by at most 1 jiffes.

This way causes blktests throtl/001 failure in case of CONFIG_HZ_100=y,
fix the problem by recording remainder as debt, and repay the debt
later.

After further analysis, I figured out that this one extra jiffy isn't the
only reason for throtl/001 failure.

In blktests throtl/001, bps_limit is 1MB/sec, and BS is 4k, and
COFIG_HZ=100, and default throttle slice is 2 jiffies(20ms):

- 20ms can submit 5 bios: 1024K/50(5*4k=20KB)

- the 6th bio is throttled, and the calculated wait is 1 jiffy from
tg_within_bps_limit()

- given all the 6 bios are handled in the time of jiffies A, so A + 1(wait) + 2(slice)
is programmed to start pending timer for scheduling dispatch

- when the pending timer is expired, the 6th bio is submitted, then the
current slice is trim/reset since the throttled 6th bio is dispatched.

Now in the whole throttle slice period, 6 bios(24KB) are submitted, and 3
jiffies are taken, so 256 bios will take (256/6) * 30ms = 1.3sec.

But blktests throtl/001 still should pass since the allowed deviation is 0.5sec,
and 1.3 < 1.5.

Actually there is another reason of timer delay, looks one extra jiffy is
delayed when the timer is triggered, which can be observed reliably by:

bpftrace -e 'kfunc:throtl_pending_timer_fn { @timer_expire_delay = lhist(jiffies - args->t->expires, 0, 16, 1);}'

Then 256 bios will take (256/6) * 40ms = 1.7sec, which does match with
observation in throtl/001.

Thanks for the detailed explanation.

Yeah, killing one jiffy may pass blktests throtl/001, but, ...


Reported-and-tested-by: Ming Lei <ming.lei@xxxxxxxxxx>
Closes: https://lore.kernel.org/all/20250220111735.1187999-1-ming.lei@xxxxxxxxxx/
Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
---
block/blk-throttle.c | 24 +++++++++++++++++-------
block/blk-throttle.h | 12 ++++++++----
2 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 0096e486b1e3..3828c6535605 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -703,9 +703,10 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
}
static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
- u64 bps_limit)
+ u64 bps_limit, bool *has_debt)
{
bool rw = bio_data_dir(bio);
+ long long carryover_bytes;
long long bytes_allowed;
u64 extra_bytes;
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
@@ -730,10 +731,16 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
- jiffy_wait = div64_u64(extra_bytes * HZ, bps_limit);
-
- if (!jiffy_wait)
- jiffy_wait = 1;
+ jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
+ if (carryover_bytes) {
+ /*
+ * If extra_bytes is not divisible, the remainder is recorded as
+ * debt. Caller must ensure the current slice has at least 1
+ * more jiffies to repay the debt.
+ */
+ *has_debt = true;
+ tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
+ }

Thinking of further, it may not be good to use ->carryover_bytes[]:

- if tg_within_bps_limit() returns 0 and the bio is dispatched
immediately, throtl_trim_slice() is called and ->carryover_bytes[]
is cleared.

I think this should not happen, because:

If the slice is extended in consideration of carryover_bytes,
throtl_slice_used() is checked first in throtl_trim_slice().

- if tg_within_bps_limit() returns >0, this bio will be throttled, and
tg_within_bps_limit() may be called more than one time, so
tg->carryover_bytes[] could be over-counted.

Yes, carryover_bytes can be over-counted with this patch. For now, the
only way that I can think of to fix this is to handle carryover_bytes
from the caller of tg_may_dispatch():
- tg_within_limit(), only handle if bio is dispatched directly;
- tg_update_disptime(), never handle;
- throtl_dispatch_tg(), always handle;

The idea is that only handle when bio is dispatched, like following
patch(not tested yet):

[yukuai@localhost linux-next]$ git diff
diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 3828c6535605..20f6bab07960 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -703,10 +703,9 @@ static unsigned long tg_within_iops_limit(struct throtl_grp *tg, struct bio *bio
}

static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
- u64 bps_limit, bool *has_debt)
+ u64 bps_limit, long long *carryover_bytes)
{
bool rw = bio_data_dir(bio);
- long long carryover_bytes;
long long bytes_allowed;
u64 extra_bytes;
unsigned long jiffy_elapsed, jiffy_wait, jiffy_elapsed_rnd;
@@ -731,16 +730,7 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,

/* Calc approx time to dispatch */
extra_bytes = tg->bytes_disp[rw] + bio_size - bytes_allowed;
- jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, &carryover_bytes);
- if (carryover_bytes) {
- /*
- * If extra_bytes is not divisible, the remainder is recorded as
- * debt. Caller must ensure the current slice has at least 1
- * more jiffies to repay the debt.
- */
- *has_debt = true;
- tg->carryover_bytes[rw] -= div64_u64(carryover_bytes, HZ);
- }
+ jiffy_wait = div64_u64_rem(extra_bytes * HZ, bps_limit, carryover_bytes);

/*
* This wait time is without taking into consideration the rounding
@@ -755,13 +745,12 @@ static unsigned long tg_within_bps_limit(struct throtl_grp *tg, struct bio *bio,
* of jiffies to wait before this bio is with-in IO rate and can be dispatched
*/
static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
- unsigned long *wait)
+ unsigned long *wait, long long *carryover_bytes)
{
bool rw = bio_data_dir(bio);
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
u64 bps_limit = tg_bps_limit(tg, rw);
u32 iops_limit = tg_iops_limit(tg, rw);
- bool has_debt = false;

/*
* Currently whole state machine of group depends on first bio
@@ -792,20 +781,18 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
else
throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);

- bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
+ bps_wait = tg_within_bps_limit(tg, bio, bps_limit, carryover_bytes);
iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
if (bps_wait + iops_wait == 0) {
if (wait)
*wait = 0;
- if (has_debt)
- throtl_extend_slice(tg, rw, jiffies + 1);
return true;
}

max_wait = max(bps_wait, iops_wait);
if (wait)
*wait = max_wait;
- throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
+ throtl_extend_slice(tg, rw, jiffies + max_wait);

return false;
}
@@ -858,19 +845,33 @@ static void throtl_add_bio_tg(struct bio *bio, struct throtl_qnode *qn,
throtl_enqueue_tg(tg);
}

+static void handle_tg_carryover_bytes(struct throtl_grp *tg,
+ long long carryover_bytes, int rw)
+{
+ if (carryover_bytes == 0)
+ return;
+ /*
+ * IO is dispatched without waiting for @carryover_bytes, make sure
+ * current slice has 1 more jiffies to repay the debt.
+ */
+ tg->carryover_bytes[rw] -= carryover_bytes;
+ throtl_extend_slice(tg, rw, tg->slice_end[rw] + 1);
+}
+
static void tg_update_disptime(struct throtl_grp *tg)
{
struct throtl_service_queue *sq = &tg->service_queue;
unsigned long read_wait = -1, write_wait = -1, min_wait = -1, disptime;
+ long long carryover_bytes = 0;
struct bio *bio;

bio = throtl_peek_queued(&sq->queued[READ]);
if (bio)
- tg_may_dispatch(tg, bio, &read_wait);
+ tg_may_dispatch(tg, bio, &read_wait, &carryover_bytes);

bio = throtl_peek_queued(&sq->queued[WRITE]);
if (bio)
- tg_may_dispatch(tg, bio, &write_wait);
+ tg_may_dispatch(tg, bio, &write_wait, &carryover_bytes);

min_wait = min(read_wait, write_wait);
disptime = jiffies + min_wait;
@@ -943,12 +944,15 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
unsigned int nr_reads = 0, nr_writes = 0;
unsigned int max_nr_reads = THROTL_GRP_QUANTUM * 3 / 4;
unsigned int max_nr_writes = THROTL_GRP_QUANTUM - max_nr_reads;
+ long long carryover_bytes = 0;
struct bio *bio;

/* Try to dispatch 75% READS and 25% WRITES */

while ((bio = throtl_peek_queued(&sq->queued[READ])) &&
- tg_may_dispatch(tg, bio, NULL)) {
+ tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
+ handle_tg_carryover_bytes(tg, carryover_bytes, READ);
+ carryover_bytes = 0;

tg_dispatch_one_bio(tg, READ);
nr_reads++;
@@ -958,7 +962,9 @@ static int throtl_dispatch_tg(struct throtl_grp *tg)
}

while ((bio = throtl_peek_queued(&sq->queued[WRITE])) &&
- tg_may_dispatch(tg, bio, NULL)) {
+ tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
+ handle_tg_carryover_bytes(tg, carryover_bytes, WRITE);
+ carryover_bytes = 0;

tg_dispatch_one_bio(tg, WRITE);
nr_writes++;
@@ -1613,11 +1619,18 @@ void blk_throtl_cancel_bios(struct gendisk *disk)

static bool tg_within_limit(struct throtl_grp *tg, struct bio *bio, bool rw)
{
+ long long carryover_bytes = 0;
+
/* throtl is FIFO - if bios are already queued, should queue */
if (tg->service_queue.nr_queued[rw])
return false;

- return tg_may_dispatch(tg, bio, NULL);
+ if (tg_may_dispatch(tg, bio, NULL, &carryover_bytes)) {
+ handle_tg_carryover_bytes(tg, carryover_bytes, rw);
+ return true;
+ }
+
+ return false;
}

static void tg_dispatch_in_debt(struct throtl_grp *tg, struct bio *bio, bool rw)


Actually this patch changes tg_within_bps_limit() to one stateful function...

/*
* This wait time is without taking into consideration the rounding
@@ -754,6 +761,7 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
unsigned long bps_wait = 0, iops_wait = 0, max_wait = 0;
u64 bps_limit = tg_bps_limit(tg, rw);
u32 iops_limit = tg_iops_limit(tg, rw);
+ bool has_debt = false;
/*
* Currently whole state machine of group depends on first bio
@@ -784,18 +792,20 @@ static bool tg_may_dispatch(struct throtl_grp *tg, struct bio *bio,
else
throtl_extend_slice(tg, rw, jiffies + tg->td->throtl_slice);
- bps_wait = tg_within_bps_limit(tg, bio, bps_limit);
+ bps_wait = tg_within_bps_limit(tg, bio, bps_limit, &has_debt);
iops_wait = tg_within_iops_limit(tg, bio, iops_limit);
if (bps_wait + iops_wait == 0) {
if (wait)
*wait = 0;
+ if (has_debt)
+ throtl_extend_slice(tg, rw, jiffies + 1);
return true;
}
max_wait = max(bps_wait, iops_wait);
if (wait)
*wait = max_wait;
- throtl_extend_slice(tg, rw, jiffies + max_wait);
+ throtl_extend_slice(tg, rw, jiffies + max_wait + has_debt);
return false;
}
diff --git a/block/blk-throttle.h b/block/blk-throttle.h
index 1a36d1278eea..56dcb5ce412f 100644
--- a/block/blk-throttle.h
+++ b/block/blk-throttle.h
@@ -110,10 +110,14 @@ struct throtl_grp {
unsigned int last_io_disp[2];
/*
- * The following two fields are updated when new configuration is
- * submitted while some bios are still throttled, they record how many
- * bytes/ios are waited already in previous configuration, and they will
- * be used to calculate wait time under new configuration.
+ * The following two fields are updated when:
+ * 1) new configuration is submitted while some bios are still
+ * throttled, they record how many bytes/ios are waited already in
+ * previous configuration;
+ * 2) IOs which may cause priority inversions are dispatched while tg is
+ * over limit, these IOs will be dispatched directly;
+ * 3) While calculating wait_time for IO, extra_bytes * HZ is not
+ * divisible by bps_limit, the remainder will be recorded;
*/

blk-throttle takes token bucket algorithm, and the implementation
shouldn't be sensitive with the above two factors, because the bps
rate is controlled over the whole bucket(slice). Meantime it is still
tricky to maintain ->carryover_bytes during slice cycle, which can't
cover timer delay too.

Another way is to avoid to trim slice too soon in case of owning too
much debt, something like the following does avoid this issue:




diff --git a/block/blk-throttle.c b/block/blk-throttle.c
index 8d149aff9fd0..a778ebbb6887 100644
--- a/block/blk-throttle.c
+++ b/block/blk-throttle.c
@@ -615,6 +615,14 @@ static inline void throtl_trim_slice(struct throtl_grp *tg, bool rw)
if (bytes_trim <= 0 && io_trim <= 0)
return;
+ if (tg_bps_limit(tg, rw) != U64_MAX) {
+ long long disp = tg->bytes_disp[rw];
+
+ if (bytes_trim > disp && (bytes_trim - disp) > disp / 2 && time_elapsed <
+ 8 * tg->td->throtl_slice)
+ return;
+ }

I'm not sure why we need this, why will there be too much debt? debt
from remainder should be at most 1 jiffies, new debt can only happen
when old debt is paid.

If you mean we don't clear debt in time, and the historical debt
accumulated too much, then we might want to trim slice sooner and a new
branch try to clear historical debt.

BTW, throtl_trim_slice() looks like problematic:

- if (bytes_trim <= 0 && io_trim <= 0)
+ if (bytes_trim <= 0 || io_trim <= 0 ||
+ tg->bytes_disp[rw] < bytes_trim || tg->io_disp[rw] < io_trim)
return;

Otherwise, debt can be ignored.

Thanks,
Kuai

+
tg->carryover_bytes[rw] = 0;
if ((long long)tg->bytes_disp[rw] >= bytes_trim)
tg->bytes_disp[rw] -= bytes_trim;


Thanks,
Ming


.