Re: [RFC PATCH] badblocks: Improvement badblocks_set() for handling multiple ranges

From: Dan Williams
Date: Thu Dec 17 2020 - 22:26:29 EST


[ add Neil, original gooodguy who wrote badblocks ]


On Thu, Dec 3, 2020 at 9:16 AM Coly Li <colyli@xxxxxxx> wrote:
>
> Recently I received a bug report that current badblocks code does not
> properly handle multiple ranges. For example,
> badblocks_set(bb, 32, 1, true);
> badblocks_set(bb, 34, 1, true);
> badblocks_set(bb, 36, 1, true);
> badblocks_set(bb, 32, 12, true);
> Then indeed badblocks_show() reports,
> 32 3
> 36 1
> But the expected bad blocks table should be,
> 32 12
> Obviously only the first 2 ranges are merged and badblocks_set() returns
> and ignores the rest setting range.
>
> This behavior is improper, if the caller of badblocks_set() wants to set
> a range of blocks into bad blocks table, all of the blocks in the range
> should be handled even the previous part encountering failure.
>
> The desired way to set bad blocks range by badblocks_set() is,
> - Set as many as blocks in the setting range into bad blocks table.
> - Merge the bad blocks ranges and occupy as less as slots in the bad
> blocks table.
> - Fast.
>
> Indeed the above proposal is complicated, especially with the following
> restrictions,
> - The setting bad blocks range can be ackknowledged or not acknowledged.

s/ackknowledged/acknowledged/

I'd run checkpatch --codespell for future versions...

> - The bad blocks table size is limited.
> - Memory allocation should be avoided.
>
> This patch is an initial effort to improve badblocks_set() for setting
> bad blocks range when it covers multiple already set bad ranges in the
> bad blocks table, and to do it as fast as possible.
>
> The basic idea of the patch is to categorize all possible bad blocks
> range setting combinationsinto to much less simplified and more less
> special conditions. Inside badblocks_set() there is an implicit loop
> composed by jumping between labels 're_insert' and 'update_sectors'. No
> matter how large the setting bad blocks range is, in every loop just a
> minimized range from the head is handled by a pre-defined behavior from
> one of the categorized conditions. The logic is simple and code flow is
> manageable.
>
> This patch is unfinished yet, it only improves badblocks_set() and not
> touch badblocks_clear() and badblocks_show() yet. I post it earlier
> because this patch will be large (more then 1000 lines of change), I
> want more people to give me comments earlier before I go too far away.
>

I wonder if this isn't indication that the base data structure should
be replaced... but I have not had a chance to devote deeper thought to
this.


> The code logic is tested as user space programmer, this patch passes
> compiling but not tested in kernel mode yet. Right now it is only for
> RFC purpose. I will post tested patch in further versions.
>
> Thank you in advance for any review or comments on this patch.
>
> Signed-off-by: Coly Li <colyli@xxxxxxx>
> ---
> block/badblocks.c | 1041 ++++++++++++++++++++++++++++++-------
> include/linux/badblocks.h | 33 ++
> 2 files changed, 881 insertions(+), 193 deletions(-)
>
> diff --git a/block/badblocks.c b/block/badblocks.c
> index d39056630d9c..04ccae95777d 100644
> --- a/block/badblocks.c
> +++ b/block/badblocks.c
> @@ -5,6 +5,8 @@
> * - Heavily based on MD badblocks code from Neil Brown
> *
> * Copyright (c) 2015, Intel Corporation.
> + *
> + * Improvement for handling multiple ranges by Coly Li <colyli@xxxxxxx>
> */
>
> #include <linux/badblocks.h>
> @@ -16,114 +18,612 @@
> #include <linux/types.h>
> #include <linux/slab.h>
>
> -/**
> - * badblocks_check() - check a given range for bad sectors
> - * @bb: the badblocks structure that holds all badblock information
> - * @s: sector (start) at which to check for badblocks
> - * @sectors: number of sectors to check for badblocks
> - * @first_bad: pointer to store location of the first badblock
> - * @bad_sectors: pointer to store number of badblocks after @first_bad
> +/*
> + * The purpose of badblocks set/clear is to manage bad blocks ranges which are
> + * identified by LBA addresses.
> *
> - * We can record which blocks on each device are 'bad' and so just
> - * fail those blocks, or that stripe, rather than the whole device.
> - * Entries in the bad-block table are 64bits wide. This comprises:
> - * Length of bad-range, in sectors: 0-511 for lengths 1-512
> - * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> - * A 'shift' can be set so that larger blocks are tracked and
> - * consequently larger devices can be covered.
> - * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + * When the caller of badblocks_set() wants to set a range of bad blocks, the
> + * setting range can be acked or unacked. And the setting range may merge,
> + * overwrite, skip the overlaypped already set range, depends on who they are
> + * overlapped or adjacent, and the acknowledgment type of the ranges. It can be
> + * more complicated when the setting range covers multiple already set bad block
> + * ranges, with restritctions of maximum length of each bad range and the bad
> + * table space limitation.
> *
> - * Locking of the bad-block table uses a seqlock so badblocks_check
> - * might need to retry if it is very unlucky.
> - * We will sometimes want to check for bad blocks in a bi_end_io function,
> - * so we use the write_seqlock_irq variant.
> + * It is difficut and unnecessary to take care of all the possible situations,
> + * for setting a large range of bad blocks, we can handle it by dividing the
> + * large range into smaller ones when encounter overlap, max range length or
> + * bad table full conditions. Every time only a smaller piece of the bad range
> + * is handled with a limited number of conditions how it is interacted with
> + * possible overlapped or adjacent already set bad block ranges. Then the hard
> + * complicated problem can be much simpler to habndle in proper way.
> *
> - * When looking for a bad block we specify a range and want to
> - * know if any block in the range is bad. So we binary-search
> - * to the last range that starts at-or-before the given endpoint,
> - * (or "before the sector after the target range")
> - * then see if it ends after the given start.
> + * When setting a range of bad blocks to the bad table, the simplified situations
> + * to be considered are, (The already set bad blocks ranges are naming with
> + * prefix E, and the setting bad blocks range is naming with prefix S)
> + *
> + * 1) A setting range is not overlapped or adjacent to any other already set bad
> + * block range.
> + * +--------+
> + * | S |
> + * +--------+
> + * +-------------+ +-------------+
> + * | E1 | | E2 |
> + * +-------------+ +-------------+
> + * For this situation if the bad blocks table is not full, just allocate a
> + * free slot from the bad blocks table to mark the setting range S. The
> + * result is,
> + * +-------------+ +--------+ +-------------+
> + * | E1 | | S | | E2 |
> + * +-------------+ +--------+ +-------------+
> + * 2) A setting range starts exactly at a start LBA of an already set bad blocks
> + * range.
> + * 2.1) The setting range size < already set range size
> + * +--------+
> + * | S |
> + * +--------+
> + * +-------------+
> + * | E |
> + * +-------------+
> + * 2.1.1) If S and E are both acked or unacked range, the setting range S can
> + * be merged into existing bad range E. The result is,
> + * +-------------+
> + * | S |
> + * +-------------+
> + * 2.1.2) If S is uncked setting and E is acked, the setting will be dinied, and
> + * the result is,
> + * +-------------+
> + * | E |
> + * +-------------+
> + * 2.1.3) If S is acked setting and E is unacked, range S can overwirte on E.
> + * An extra slot from the bad blocks table will be allocated for S, and head
> + * of E will move to end of the inserted range E. The result is,
> + * +--------+----+
> + * | S | E |
> + * +--------+----+
> + * 2.2) The setting range size == already set range size
> + * 2.2.1) If S and E are both acked or unacked range, the setting range S can
> + * be merged into existing bad range E. The result is,
> + * +-------------+
> + * | S |
> + * +-------------+
> + * 2.2.2) If S is uncked setting and E is acked, the setting will be dinied, and
> + * the result is,
> + * +-------------+
> + * | E |
> + * +-------------+
> + * 2.2.3) If S is acked setting and E is unacked, range S can overwirte all of
> + bad blocks range E. The result is,
> + * +-------------+
> + * | S |
> + * +-------------+
> + * 2.3) The setting range size > already set range size
> + * +-------------------+
> + * | S |
> + * +-------------------+
> + * +-------------+
> + * | E |
> + * +-------------+
> + * For such situation, the setting range S can be treated as two parts, the
> + * first part (S1) is as same size as the already set range E, the second
> + * part (S2) is thre rest of setting range.
> + * +-------------+-----+ +-------------+ +-----+
> + * | S1 | S2 | | S1 | | S2 |
> + * +-------------+-----+ ===> +-------------+ +-----+
> + * +-------------+ +-------------+
> + * | E | | E |
> + * +-------------+ +-------------+
> + * Now we only focus on how to handle the setting range S1 and already set
> + * range E, which are already explained in 1.2), for the rest S2 it will be
> + * handled later in next loop.
> + * 3) A setting range starts before the start LBA of an already set bad blocks
> + * range.
> + * +-------------+
> + * | S |
> + * +-------------+
> + * +-------------+
> + * | E |
> + * +-------------+
> + * For this situation, the setting range S can be divided into two parts, the
> + * first (S1) ends at the start LBA of already set range E, the second part
> + * (S2) starts exactly at a start LBA of the already set range E.
> + * +----+---------+ +----+ +---------+
> + * | S1 | S2 | | S1 | | S2 |
> + * +----+---------+ ===> +----+ +---------+
> + * +-------------+ +-------------+
> + * | E | | E |
> + * +-------------+ +-------------+
> + * Now only the first part S1 should be handled in this loop, which is in
> + * similar condition as 1). The rest part S2 has exact same start LBA address
> + * of the already set range E, they will be handled in next loop in one of
> + * situations in 2).
> + * 4) A setting range starts after the start LBA of an already set bad blocks
> + * range.
> + * 4.1) If the setting range S exactly matches the tail part of already set bad
> + * blocks range E, like the folowing chart shows,
> + * +---------+
> + * | S |
> + * +---------+
> + * +-------------+
> + * | E |
> + * +-------------+
> + * 4.1.1) If range S and E have same ackknowledg value (both acked or unacked),
> + * they will be merged into one, the result is,
> + * +-------------+
> + * | S |
> + * +-------------+
> + * 4.1.2) If range E is acked and the setting range S is unacked, the setting
> + * request of S will be rejected, the result is,
> + * +-------------+
> + * | E |
> + * +-------------+
> + * 4.1.3) If range E is unacked, and the setting range S is acked, then S may
> + * overwrite the overlapped range of E, the result is,
> + * +---+---------+
> + * | E | S |
> + * +---+---------+
> + * 4.2) If the setting range S stays in middle of an already set range E, like
> + * the following chart shows,
> + * +----+
> + * | S |
> + * +----+
> + * +--------------+
> + * | E |
> + * +--------------+
> + * 4.2.1) If range S and E have same ackknowledg value (both acked or unacked),
> + * they will be merged into one, the result is,
> + * +--------------+
> + * | S |
> + * +--------------+
> + * 4.2.2) If range E is acked and the setting range S is unacked, the setting
> + * request of S will be rejected, the result is also,
> + * +--------------+
> + * | E |
> + * +--------------+
> + * 4.2.3) If range E is unacked, and the setting range S is acked, then S will
> + * inserted into middle of E and split previous range E into twp parts (E1
> + * and E2), the result is,
> + * +----+----+----+
> + * | E1 | S | E2 |
> + * +----+----+----+
> + * 4.3) If the setting bad blocks range S is overlapped with an already set bad
> + * blocks range E. The range S starts after the start LBA of range E, and
> + * ends after the end LBA of range E, as the following chart shows,
> + * +-------------------+
> + * | S |
> + * +-------------------+
> + * +-------------+
> + * | E |
> + * +-------------+
> + * For this situation the range S can be divided into two parts, the first
> + * part (S1) ends at end range E, and the second part (S2) has rest range of
> + * origin S.
> + * +---------+---------+ +---------+ +---------+
> + * | S1 | S2 | | S1 | | S2 |
> + * +---------+---------+ ===> +---------+ +---------+
> + * +-------------+ +-------------+
> + * | E | | E |
> + * +-------------+ +-------------+
> + * Now in this loop the setting range S1 and already set range E can be
> + * handled as the situations 4), the rest range S2 will be handled in next
> + * loop and ignored in this loop.
> + * 5) A setting bad blocks range S is adjacent to one or more already set bad
> + * blocks range(s), and they are all acked or unacked range.
> + * 5.1) Front merge: If the already set bad blocks range E is before setting
> + * range S and they are adjacent,
> + * +------+
> + * | S |
> + * +------+
> + * +-------+
> + * | E |
> + * +-------+
> + * 5.1.1) When total size of range S and E <= BB_MAX_LEN, and their acknowledge
> + * values are same, the setting range S can front merges into range E. The
> + * result is,
> + * +--------------+
> + * | S |
> + * +--------------+
> + * 5.1.2) Otherwise these two ranges cannot merge, just insert the setting
> + * range S right after already set range E into the bad blocks table. The
> + * result is,
> + * +--------+------+
> + * | E | S |
> + * +--------+------+
> + * 6) Special cases which above conditions cannot handle
> + * 6.1) Multiple already set ranges may merge into less ones in a full bad table
> + * +-------------------------------------------------------+
> + * | S |
> + * +-------------------------------------------------------+
> + * |<----- BB_MAX_LEN ----->|
> + * +-----+ +-----+ +-----+
> + * | E1 | | E2 | | E3 |
> + * +-----+ +-----+ +-----+
> + * In the above example, when the bad blocks table is full, inserting the
> + * first part of setting range S will fail because no more available slot
> + * can be allocated from bad blocks table. In this situation a proper
> + * setting method should be go though all the setting bad blocks range and
> + * look for chance to merge already set ranges into less ones. When there
> + * is available slot from bad blocks table, re-try again to handle more
> + * setting bad blocks ranges as many as possible.
> + * +------------------------+
> + * | S3 |
> + * +------------------------+
> + * |<----- BB_MAX_LEN ----->|
> + * +-----+-----+-----+---+-----+--+
> + * | S1 | S2 |
> + * +-----+-----+-----+---+-----+--+
> + * The above chart shows although the first part (S3) cannot be inserted due
> + * to no-space in bad blocks table, but the following E1, E2 and E3 ranges
> + * can be merged with rest part of S into less range S1 and S2. Now there is
> + * 1 free slot in bad blocks table.
> + * +------------------------+-----+-----+-----+---+-----+--+
> + * | S3 | S1 | S2 |
> + * +------------------------+-----+-----+-----+---+-----+--+
> + * Since the bad blocks table is not full anymore, re-try again for the
> + * origin setting range S. Now the setting range S3 can be inserted into the
> + * bad blocks table with previous freed slot from multiple ranges merge.
> + * 6.2) Front merge after overwrite
> + * In the following example, in bad blocks table, E1 is an acked bad blocks
> + * range and E2 is an unacked bad blocks range, therefore they are not able
> + * to merge into a larger range. The setting bad blocks range S is acked,
> + * therefore part of E2 can be overwritten by S.
> + * +--------+
> + * | S | acknowledged
> + * +--------+ S: 1
> + * +-------+-------------+ E1: 1
> + * | E1 | E2 | E2: 0
> + * +-------+-------------+
> + * With previosu simplified routines, after overwiting part of E2 with S,
> + * the bad blocks table should be (E3 is remaining part of E2 which is not
> + * overwritten by S),
> + * acknowledged
> + * +-------+--------+----+ S: 1
> + * | E1 | S | E3 | E1: 1
> + * +-------+--------+----+ E3: 0
> + * The above result is correct but not perfect. Range E1 and S in the bad
> + * blocks table are all acked, merging them into a larger one range may
> + * occupy less bad blocks table space and make badblocks_check() faster.
> + * Therefore in such situation, after overwiting range S, the previous range
> + * E1 should be checked for possible front combination. Then the ideal
> + * result can be,
> + * +----------------+----+ acknowledged
> + * | E1 | E3 | E1: 1
> + * +----------------+----+ E3: 0
> + * 6.3) Behind merge: If the already set bad blocks range E is behind the setting
> + * range S and they are adjacent. Normally we don't need to care about this
> + * because front merge handles this while going though range S from head to
> + * tail, except for the tail part of range S. When the setting range S are
> + * fully handled, all the above simplified routine doesn't check whether the
> + * tail LBA of range S is adjacent to the next already set range and not able
> + * to them if they are mergeable.
> + * +------+
> + * | S |
> + * +------+
> + * +-------+
> + * | E |
> + * +-------+
> + * For the above special stiuation, when the setting range S are all handled
> + * and the loop ends, an extra check is necessary for whether next already
> + * set range E is right after S and mergeable.
> + * 6.2.1) When total size of range E and S <= BB_MAX_LEN, and their acknowledge
> + * values are same, the setting range S can behind merges into range E. The
> + * result is,
> + * +--------------+
> + * | S |
> + * +--------------+
> + * 6.2.2) Otherwise these two ranges cannot merge, just insert the setting range
> + * S infront of the already set range E in the bad blocks table. The result
> + * is,
> + * +------+-------+
> + * | S | E |
> + * +------+-------+
> + *
> + * All the above 5 simplified situations and 3 special cases may cover 99%+ of
> + * the bad block range setting conditions. Maybe there is some rare corner case
> + * is not considered and optimized, it won't hurt if badblocks_set() fails due
> + * to no space, or some ranges are not merged to save bad blocks table space.
> + *
> + * Inside badblocks_set() each loop starts by jumping to re_insert label, every
> + * time for the new loop prev_badblocks() is called to find an already set range
> + * which starts before or at current setting range. Since the setting bad blocks
> + * range is handled from head to tail, most of the cases it is unnecessary to do
> + * the binary search inside prev_badblocks(), it is possible to provide a hint
> + * to prev_badblocks() for a fast path, then the expensive binary search can be
> + * avoided. In my test with the hint to prev_badblocks(), except for the first
> + * loop, all rested calls to prev_badblocks() can go into the fast path and
> + * return correct bad blocks table index immediately.
> *
> - * Return:
> - * 0: there are no known bad blocks in the range
> - * 1: there are known bad block which are all acknowledged
> - * -1: there are bad blocks which have not yet been acknowledged in metadata.
> - * plus the start/length of the first bad section we overlap.
> */
> -int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> - sector_t *first_bad, int *bad_sectors)
> +
> +static int prev_by_hint(struct badblocks *bb, sector_t s, int hint)
> {
> - int hi;
> - int lo;
> u64 *p = bb->page;
> - int rv;
> - sector_t target = s + sectors;
> - unsigned seq;
> + int ret = -1;
> + int hint_end = hint + 2;
>
> - if (bb->shift > 0) {
> - /* round the start down, and the end up */
> - s >>= bb->shift;
> - target += (1<<bb->shift) - 1;
> - target >>= bb->shift;
> - sectors = target - s;
> + while ((hint < hint_end) && ((hint + 1) <= bb->count) &&
> + (BB_OFFSET(p[hint]) <= s)) {
> + if ((hint + 1) == bb->count || BB_OFFSET(p[hint + 1]) > s) {
> + ret = hint;
> + break;
> + }
> + hint++;
> + }
> +
> + return ret;
> +}
> +
> +/* find the range starts at-or-before bad->start */
> +static int prev_badblocks(struct badblocks *bb, struct bad_context *bad,
> + int hint)
> +{
> + u64 *p;
> + int lo, hi;
> + sector_t s = bad->start;
> + int ret = -1;
> +
> + if (!bb->count)
> + goto out;
> +
> + if (hint >= 0) {
> + ret = prev_by_hint(bb, s, hint);
> + if (ret >= 0)
> + goto out;
> }
> - /* 'target' is now the first block after the bad range */
>
> -retry:
> - seq = read_seqbegin(&bb->lock);
> lo = 0;
> - rv = 0;
> hi = bb->count;
> + p = bb->page;
>
> - /* Binary search between lo and hi for 'target'
> - * i.e. for the last range that starts before 'target'
> - */
> - /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
> - * are known not to be the last range before target.
> - * VARIANT: hi-lo is the number of possible
> - * ranges, and decreases until it reaches 1
> - */
> while (hi - lo > 1) {
> - int mid = (lo + hi) / 2;
> + int mid = (lo + hi)/2;
> sector_t a = BB_OFFSET(p[mid]);
>
> - if (a < target)
> - /* This could still be the one, earlier ranges
> - * could not.
> - */
> + if (a <= s)
> lo = mid;
> else
> - /* This and later ranges are definitely out. */
> hi = mid;
> }
> - /* 'lo' might be the last that started before target, but 'hi' isn't */
> - if (hi > lo) {
> - /* need to check all range that end after 's' to see if
> - * any are unacknowledged.
> +
> + if (BB_OFFSET(p[lo]) <= s)
> + ret = lo;
> +out:
> + return ret;
> +}
> +
> +static int can_merge_behind(struct badblocks *bb, struct bad_context *bad,
> + int behind)
> +{
> + u64 *p = bb->page;
> + sector_t s = bad->start;
> + sector_t sectors = bad->len;
> + int ack = bad->ack;
> +
> + if ((s <= BB_OFFSET(p[behind])) &&
> + ((s + sectors) >= BB_OFFSET(p[behind])) &&
> + ((BB_END(p[behind]) - s) <= BB_MAX_LEN) &&
> + BB_ACK(p[behind]) == ack)
> + return true;
> + return false;
> +}
> +
> +static int behind_merge(struct badblocks *bb, struct bad_context *bad,
> + int behind)
> +{
> + u64 *p = bb->page;
> + sector_t s = bad->start;
> + sector_t sectors = bad->len;
> + int ack = bad->ack;
> + int merged = 0;
> +
> + WARN_ON(s > BB_OFFSET(p[behind]));
> + WARN_ON((s + sectors) < BB_OFFSET(p[behind]));
> +
> + if (s < BB_OFFSET(p[behind])) {
> + WARN_ON((BB_LEN(p[behind]) + merged) >= BB_MAX_LEN);
> +
> + merged = min_t(sector_t, sectors, BB_OFFSET(p[behind]) - s);
> + p[behind] = BB_MAKE(s, BB_LEN(p[behind]) + merged, ack);
> + } else {
> + merged = min_t(sector_t, sectors, BB_LEN(p[behind]));
> + }
> +
> + WARN_ON(merged == 0);
> +
> + return merged;
> +}
> +
> +static int can_merge_front(struct badblocks *bb, int prev,
> + struct bad_context *bad)
> +{
> + u64 *p = bb->page;
> + sector_t s = bad->start;
> + int ack = bad->ack;
> +
> + if (BB_ACK(p[prev]) == ack &&
> + (s < BB_END(p[prev]) ||
> + (s == BB_END(p[prev]) && (BB_LEN(p[prev]) < BB_MAX_LEN))))
> + return true;
> + return false;
> +}
> +
> +static int front_merge(struct badblocks *bb, int prev, struct bad_context *bad)
> +{
> + int sectors = bad->len;
> + int s = bad->start;
> + int ack = bad->ack;
> + u64 *p = bb->page;
> + int merged = 0;
> +
> + WARN_ON(s > BB_END(p[prev]));
> +
> + if (s < BB_END(p[prev])) {
> + merged = min_t(sector_t, sectors, BB_END(p[prev]) - s);
> + } else {
> + merged = min_t(sector_t, sectors, BB_MAX_LEN - BB_LEN(p[prev]));
> + if ((prev + 1) < bb->count &&
> + merged > (BB_OFFSET(p[prev + 1]) - BB_END(p[prev]))) {
> + merged = BB_OFFSET(p[prev + 1]) - BB_END(p[prev]);
> + }
> +
> + p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]) + merged, ack);
> + }
> +
> + return merged;
> +}
> +
> +static int can_combine_front(struct badblocks *bb, int prev,
> + struct bad_context *bad)
> +{
> + u64 *p = bb->page;
> +
> + if ((BB_OFFSET(p[prev]) == bad->start) && (prev > 0) &&
> + (BB_LEN(p[prev - 1]) + BB_LEN(p[prev]) <= BB_MAX_LEN) &&
> + (BB_ACK(p[prev - 1]) == BB_ACK(p[prev])))
> + return true;
> + return false;
> +}
> +
> +static void front_combine(struct badblocks *bb, int prev)
> +{
> + u64 *p = bb->page;
> +
> + p[prev - 1] = BB_MAKE(BB_OFFSET(p[prev - 1]),
> + BB_LEN(p[prev - 1]) + BB_LEN(p[prev]),
> + BB_ACK(p[prev]));
> + if ((prev + 1) < bb->count)
> + memmove(p + prev, p + prev + 1, (bb->count - prev - 1) * 8);
> +}
> +
> +static int overlap_front(struct badblocks *bb, int front,
> + struct bad_context *bad)
> +{
> + u64 *p = bb->page;
> +
> + if (bad->start >= BB_OFFSET(p[front]) &&
> + bad->start < BB_END(p[front]))
> + return true;
> + return false;
> +}
> +
> +static int can_front_overwrite(struct badblocks *bb, int prev,
> + struct bad_context *bad, int *extra)
> +{
> + u64 *p = bb->page;
> + int len;
> +
> + WARN_ON(!overlap_front(bb, prev, bad));
> +
> + if (BB_ACK(p[prev]) >= bad->ack)
> + return false;
> +
> + if (BB_END(p[prev]) <= (bad->start + bad->len)) {
> + len = BB_END(p[prev]) - bad->start;
> + if (BB_OFFSET(p[prev]) == bad->start)
> + *extra = 0;
> + else
> + *extra = 1;
> +
> + bad->len = len;
> + } else {
> + if (BB_OFFSET(p[prev]) == bad->start)
> + *extra = 1;
> + else
> + /*
> + * prev range will be split into two, beside the overwritten
> + * one, an extra slot needed from bad table.
> */
> - while (lo >= 0 &&
> - BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
> - if (BB_OFFSET(p[lo]) < target) {
> - /* starts before the end, and finishes after
> - * the start, so they must overlap
> - */
> - if (rv != -1 && BB_ACK(p[lo]))
> - rv = 1;
> - else
> - rv = -1;
> - *first_bad = BB_OFFSET(p[lo]);
> - *bad_sectors = BB_LEN(p[lo]);
> - }
> - lo--;
> + *extra = 2;
> + }
> +
> + if ((bb->count + (*extra)) >= MAX_BADBLOCKS)
> + return false;
> +
> + return true;
> +}
> +
> +static int front_overwrite(struct badblocks *bb, int prev,
> + struct bad_context *bad, int extra)
> +{
> + u64 *p = bb->page;
> + int n = extra;
> + sector_t orig_end = BB_END(p[prev]);
> + int orig_ack = BB_ACK(p[prev]);
> +
> + switch (extra) {
> + case 0:
> + p[prev] = BB_MAKE(BB_OFFSET(p[prev]), BB_LEN(p[prev]),
> + bad->ack);
> + break;
> + case 1:
> + if (BB_OFFSET(p[prev]) == bad->start) {
> + p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> + bad->len, bad->ack);
> + memmove(p + prev + 2, p + prev + 1,
> + (bb->count - prev - 1) * 8);
> + p[prev + 1] = BB_MAKE(bad->start + bad->len,
> + orig_end - BB_END(p[prev]),
> + orig_ack);
> + } else {
> + p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> + BB_END(p[prev]) - bad->start,
> + BB_ACK(p[prev]));
> + memmove(p + prev + 1 + n, p + prev + 1,
> + (bb->count - prev - 1) * 8);
> + p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
> }
> + break;
> + case 2:
> + p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> + BB_END(p[prev]) - bad->start,
> + BB_ACK(p[prev]));
> + memmove(p + prev + 1 + n, p + prev + 1,
> + (bb->count - prev - 1) * 8);
> + p[prev + 1] = BB_MAKE(bad->start, bad->len, bad->ack);
> + p[prev + 2] = BB_MAKE(BB_END(p[prev + 1]),
> + orig_end - BB_END(p[prev + 1]),
> + BB_ACK(p[prev]));
> + break;
> + default:
> + break;
> }
>
> - if (read_seqretry(&bb->lock, seq))
> - goto retry;
> + return bad->len;
> +}
>
> - return rv;
> +static int overlap_behind(struct badblocks *bb, struct bad_context *bad,
> + int behind)
> +{
> + u64 *p = bb->page;
> +
> + if (bad->start < BB_OFFSET(p[behind]) &&
> + (bad->start + bad->len) > BB_OFFSET(p[behind]))
> + return true;
> +
> + if (bad->start >= BB_OFFSET(p[behind]) &&
> + bad->start < BB_END(p[behind]))
> + return true;
> +
> + return false;
> +}
> +
> +static int insert_at(struct badblocks *bb, int at, struct bad_context *bad)
> +{
> + u64 *p = bb->page;
> + int sectors = bad->len;
> + int s = bad->start;
> + int ack = bad->ack;
> + int len;
> +
> + WARN_ON(badblocks_full(bb));
> +
> + len = min_t(sector_t, sectors, BB_MAX_LEN);
> + if (at < bb->count)
> + memmove(p + at + 1, p + at, (bb->count - at) * 8);
> + p[at] = BB_MAKE(s, len, ack);
> +
> + return len;
> }
> -EXPORT_SYMBOL_GPL(badblocks_check);
>
> static void badblocks_update_acked(struct badblocks *bb)
> {
> @@ -164,7 +664,10 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
> int acknowledged)
> {
> u64 *p;
> - int lo, hi;
> + struct bad_context bad;
> + int prev = -1, hint = -1;
> + int len = 0, added = 0;
> + int retried = 0, space_desired = 0;
> int rv = 0;
> unsigned long flags;
>
> @@ -172,144 +675,187 @@ int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
> /* badblocks are disabled */
> return 1;
>
> + if (sectors <= 0)
> + /* Invalid sectors number */
> + return 1;
> +
> if (bb->shift) {
> /* round the start down, and the end up */
> sector_t next = s + sectors;
>
> - s >>= bb->shift;
> - next += (1<<bb->shift) - 1;
> - next >>= bb->shift;
> + rounddown(s, bb->shift);
> + roundup(next, bb->shift);
> sectors = next - s;
> }
>
> write_seqlock_irqsave(&bb->lock, flags);
>
> + bad.orig_start = s;
> + bad.orig_len = sectors;
> + bad.ack = acknowledged;
> p = bb->page;
> - lo = 0;
> - hi = bb->count;
> - /* Find the last range that starts at-or-before 's' */
> - while (hi - lo > 1) {
> - int mid = (lo + hi) / 2;
> - sector_t a = BB_OFFSET(p[mid]);
>
> - if (a <= s)
> - lo = mid;
> - else
> - hi = mid;
> +re_insert:
> + bad.start = s;
> + bad.len = sectors;
> + len = 0;
> +
> + if (badblocks_empty(bb)) {
> + len = insert_at(bb, 0, &bad);
> + bb->count++;
> + added++;
> + goto update_sectors;
> }
> - if (hi > lo && BB_OFFSET(p[lo]) > s)
> - hi = lo;
>
> - if (hi > lo) {
> - /* we found a range that might merge with the start
> - * of our new range
> - */
> - sector_t a = BB_OFFSET(p[lo]);
> - sector_t e = a + BB_LEN(p[lo]);
> - int ack = BB_ACK(p[lo]);
> -
> - if (e >= s) {
> - /* Yes, we can merge with a previous range */
> - if (s == a && s + sectors >= e)
> - /* new range covers old */
> - ack = acknowledged;
> - else
> - ack = ack && acknowledged;
> -
> - if (e < s + sectors)
> - e = s + sectors;
> - if (e - a <= BB_MAX_LEN) {
> - p[lo] = BB_MAKE(a, e-a, ack);
> - s = e;
> + prev = prev_badblocks(bb, &bad, hint);
> +
> + /* start before all badblocks */
> + if (prev < 0) {
> + if (!badblocks_full(bb)) {
> + /* insert on the first */
> + if (bad.len > (BB_OFFSET(p[0]) - bad.start))
> + bad.len = BB_OFFSET(p[0]) - bad.start;
> + len = insert_at(bb, 0, &bad);
> + bb->count++;
> + added++;
> + hint = 0;
> + goto update_sectors;
> + }
> +
> + /* No sapce, try to merge */
> + if (overlap_behind(bb, &bad, 0)) {
> + if (can_merge_behind(bb, &bad, 0)) {
> + len = behind_merge(bb, &bad, 0);
> + added++;
> } else {
> - /* does not all fit in one range,
> - * make p[lo] maximal
> - */
> - if (BB_LEN(p[lo]) != BB_MAX_LEN)
> - p[lo] = BB_MAKE(a, BB_MAX_LEN, ack);
> - s = a + BB_MAX_LEN;
> + len = min_t(sector_t, BB_OFFSET(p[0]) - s, sectors);
> + space_desired = 1;
> }
> - sectors = e - s;
> + hint = 0;
> + goto update_sectors;
> }
> +
> + /* no table space and give up */
> + goto out;
> }
> - if (sectors && hi < bb->count) {
> - /* 'hi' points to the first range that starts after 's'.
> - * Maybe we can merge with the start of that range
> - */
> - sector_t a = BB_OFFSET(p[hi]);
> - sector_t e = a + BB_LEN(p[hi]);
> - int ack = BB_ACK(p[hi]);
> -
> - if (a <= s + sectors) {
> - /* merging is possible */
> - if (e <= s + sectors) {
> - /* full overlap */
> - e = s + sectors;
> - ack = acknowledged;
> - } else
> - ack = ack && acknowledged;
> -
> - a = s;
> - if (e - a <= BB_MAX_LEN) {
> - p[hi] = BB_MAKE(a, e-a, ack);
> - s = e;
> - } else {
> - p[hi] = BB_MAKE(a, BB_MAX_LEN, ack);
> - s = a + BB_MAX_LEN;
> +
> + /* in case p[prev-1] can be merged with p[prev] */
> + if (can_combine_front(bb, prev, &bad)) {
> + front_combine(bb, prev);
> + bb->count--;
> + added++;
> + hint = prev - 1;
> + goto update_sectors;
> + }
> +
> + if (overlap_front(bb, prev, &bad)) {
> + if (can_merge_front(bb, prev, &bad)) {
> + len = front_merge(bb, prev, &bad);
> + added++;
> + hint = prev - 1;
> + } else {
> + int extra = 0;
> +
> + if (!can_front_overwrite(bb, prev, &bad, &extra)) {
> + len = min_t(sector_t, BB_END(p[prev]) - s, sectors);
> + hint = prev;
> + goto update_sectors;
> + }
> +
> + len = front_overwrite(bb, prev, &bad, extra);
> + added++;
> + bb->count += extra;
> + hint = prev;
> +
> + if (prev > 0 && can_combine_front(bb, prev, &bad)) {
> + front_combine(bb, prev);
> + bb->count--;
> + hint = prev - 1;
> }
> - sectors = e - s;
> - lo = hi;
> - hi++;
> }
> + goto update_sectors;
> + }
> +
> + if (can_merge_front(bb, prev, &bad)) {
> + len = front_merge(bb, prev, &bad);
> + added++;
> + hint = prev;
> + goto update_sectors;
> }
> - if (sectors == 0 && hi < bb->count) {
> - /* we might be able to combine lo and hi */
> - /* Note: 's' is at the end of 'lo' */
> - sector_t a = BB_OFFSET(p[hi]);
> - int lolen = BB_LEN(p[lo]);
> - int hilen = BB_LEN(p[hi]);
> - int newlen = lolen + hilen - (s - a);
> -
> - if (s >= a && newlen < BB_MAX_LEN) {
> - /* yes, we can combine them */
> - int ack = BB_ACK(p[lo]) && BB_ACK(p[hi]);
> -
> - p[lo] = BB_MAKE(BB_OFFSET(p[lo]), newlen, ack);
> - memmove(p + hi, p + hi + 1,
> - (bb->count - hi - 1) * 8);
> - bb->count--;
> +
> + /* if no space in table, still try to merge in the covered range */
> + if (badblocks_full(bb)) {
> + /* skip the cannot-merge range */
> + if (((prev + 1) < bb->count) &&
> + overlap_behind(bb, &bad, prev + 1) &&
> + ((s + sectors) >= BB_END(p[prev + 1]))) {
> + len = BB_END(p[prev + 1]) - s;
> + hint = prev + 1;
> + goto update_sectors;
> }
> +
> + /* no retry any more */
> + len = sectors;
> + space_desired = 1;
> + hint = -1;
> + goto update_sectors;
> }
> - while (sectors) {
> - /* didn't merge (it all).
> - * Need to add a range just before 'hi'
> - */
> - if (bb->count >= MAX_BADBLOCKS) {
> - /* No room for more */
> - rv = 1;
> - break;
> - } else {
> - int this_sectors = sectors;
>
> - memmove(p + hi + 1, p + hi,
> - (bb->count - hi) * 8);
> - bb->count++;
> + /* cannot merge and there is space in bad table */
> + if (overlap_behind(bb, &bad, prev + 1))
> + bad.len = min_t(sector_t, bad.len, BB_OFFSET(p[prev + 1]) - bad.start);
>
> - if (this_sectors > BB_MAX_LEN)
> - this_sectors = BB_MAX_LEN;
> - p[hi] = BB_MAKE(s, this_sectors, acknowledged);
> - sectors -= this_sectors;
> - s += this_sectors;
> - }
> + len = insert_at(bb, prev + 1, &bad);
> + bb->count++;
> + added++;
> + hint = prev + 1;
> +
> +update_sectors:
> + s += len;
> + sectors -= len;
> +
> + if (sectors > 0)
> + goto re_insert;
> +
> + WARN_ON(sectors < 0);
> +
> + /* Check whether the following already set range can be merged */
> + if ((prev + 1) < bb->count &&
> + BB_END(p[prev]) == BB_OFFSET(p[prev + 1]) &&
> + (BB_LEN(p[prev]) + BB_LEN(p[prev + 1])) <= BB_MAX_LEN &&
> + BB_ACK(p[prev]) == BB_ACK(p[prev + 1])) {
> + p[prev] = BB_MAKE(BB_OFFSET(p[prev]),
> + BB_LEN(p[prev]) + BB_LEN(p[prev + 1]),
> + BB_ACK(p[prev]));
> +
> + if ((prev + 2) < bb->count)
> + memmove(p + prev + 1, p + prev + 2,
> + (bb->count - (prev + 2)) * 8);
> + bb->count--;
> + }
> +
> + if (space_desired && !badblocks_full(bb)) {
> + s = bad.orig_start;
> + sectors = bad.orig_len;
> + if (retried++ < 3)
> + goto re_insert;
> + }
> +
> +out:
> + if (added) {
> + set_changed(bb);
> +
> + if (!acknowledged)
> + bb->unacked_exist = 1;
> + else
> + badblocks_update_acked(bb);
> }
>
> - bb->changed = 1;
> - if (!acknowledged)
> - bb->unacked_exist = 1;
> - else
> - badblocks_update_acked(bb);
> write_sequnlock_irqrestore(&bb->lock, flags);
>
> + if (!added)
> + rv = 1;
> +
> return rv;
> }
> EXPORT_SYMBOL_GPL(badblocks_set);
> @@ -423,6 +969,115 @@ int badblocks_clear(struct badblocks *bb, sector_t s, int sectors)
> }
> EXPORT_SYMBOL_GPL(badblocks_clear);
>
> +/**
> + * badblocks_check() - check a given range for bad sectors
> + * @bb: the badblocks structure that holds all badblock information
> + * @s: sector (start) at which to check for badblocks
> + * @sectors: number of sectors to check for badblocks
> + * @first_bad: pointer to store location of the first badblock
> + * @bad_sectors: pointer to store number of badblocks after @first_bad
> + *
> + * We can record which blocks on each device are 'bad' and so just
> + * fail those blocks, or that stripe, rather than the whole device.
> + * Entries in the bad-block table are 64bits wide. This comprises:
> + * Length of bad-range, in sectors: 0-511 for lengths 1-512
> + * Start of bad-range, sector offset, 54 bits (allows 8 exbibytes)
> + * A 'shift' can be set so that larger blocks are tracked and
> + * consequently larger devices can be covered.
> + * 'Acknowledged' flag - 1 bit. - the most significant bit.
> + *
> + * Locking of the bad-block table uses a seqlock so badblocks_check
> + * might need to retry if it is very unlucky.
> + * We will sometimes want to check for bad blocks in a bi_end_io function,
> + * so we use the write_seqlock_irq variant.
> + *
> + * When looking for a bad block we specify a range and want to
> + * know if any block in the range is bad. So we binary-search
> + * to the last range that starts at-or-before the given endpoint,
> + * (or "before the sector after the target range")
> + * then see if it ends after the given start.
> + *
> + * Return:
> + * 0: there are no known bad blocks in the range
> + * 1: there are known bad block which are all acknowledged
> + * -1: there are bad blocks which have not yet been acknowledged in metadata.
> + * plus the start/length of the first bad section we overlap.
> + */
> +int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> + sector_t *first_bad, int *bad_sectors)
> +{
> + int hi;
> + int lo;
> + u64 *p = bb->page;
> + int rv;
> + sector_t target = s + sectors;
> + unsigned seq;
> +
> + if (bb->shift > 0) {
> + /* round the start down, and the end up */
> + s >>= bb->shift;
> + target += (1<<bb->shift) - 1;
> + target >>= bb->shift;
> + sectors = target - s;
> + }
> + /* 'target' is now the first block after the bad range */
> +
> +retry:
> + seq = read_seqbegin(&bb->lock);
> + lo = 0;
> + rv = 0;
> + hi = bb->count;
> +
> + /* Binary search between lo and hi for 'target'
> + * i.e. for the last range that starts before 'target'
> + */
> + /* INVARIANT: ranges before 'lo' and at-or-after 'hi'
> + * are known not to be the last range before target.
> + * VARIANT: hi-lo is the number of possible
> + * ranges, and decreases until it reaches 1
> + */
> + while (hi - lo > 1) {
> + int mid = (lo + hi) / 2;
> + sector_t a = BB_OFFSET(p[mid]);
> +
> + if (a < target)
> + /* This could still be the one, earlier ranges
> + * could not.
> + */
> + lo = mid;
> + else
> + /* This and later ranges are definitely out. */
> + hi = mid;
> + }
> + /* 'lo' might be the last that started before target, but 'hi' isn't */
> + if (hi > lo) {
> + /* need to check all range that end after 's' to see if
> + * any are unacknowledged.
> + */
> + while (lo >= 0 &&
> + BB_OFFSET(p[lo]) + BB_LEN(p[lo]) > s) {
> + if (BB_OFFSET(p[lo]) < target) {
> + /* starts before the end, and finishes after
> + * the start, so they must overlap
> + */
> + if (rv != -1 && BB_ACK(p[lo]))
> + rv = 1;
> + else
> + rv = -1;
> + *first_bad = BB_OFFSET(p[lo]);
> + *bad_sectors = BB_LEN(p[lo]);
> + }
> + lo--;
> + }
> + }
> +
> + if (read_seqretry(&bb->lock, seq))
> + goto retry;
> +
> + return rv;
> +}
> +EXPORT_SYMBOL_GPL(badblocks_check);
> +
> /**
> * ack_all_badblocks() - Acknowledge all bad blocks in a list.
> * @bb: the badblocks structure that holds all badblock information
> diff --git a/include/linux/badblocks.h b/include/linux/badblocks.h
> index 2426276b9bd3..b4bd997a53a4 100644
> --- a/include/linux/badblocks.h
> +++ b/include/linux/badblocks.h
> @@ -15,6 +15,7 @@
> #define BB_OFFSET(x) (((x) & BB_OFFSET_MASK) >> 9)
> #define BB_LEN(x) (((x) & BB_LEN_MASK) + 1)
> #define BB_ACK(x) (!!((x) & BB_ACK_MASK))
> +#define BB_END(x) (BB_OFFSET(x) + BB_LEN(x))
> #define BB_MAKE(a, l, ack) (((a)<<9) | ((l)-1) | ((u64)(!!(ack)) << 63))
>
> /* Bad block numbers are stored sorted in a single page.
> @@ -41,6 +42,14 @@ struct badblocks {
> sector_t size; /* in sectors */
> };
>
> +struct bad_context {
> + sector_t start;
> + sector_t len;
> + int ack;
> + sector_t orig_start;
> + sector_t orig_len;
> +};
> +
> int badblocks_check(struct badblocks *bb, sector_t s, int sectors,
> sector_t *first_bad, int *bad_sectors);
> int badblocks_set(struct badblocks *bb, sector_t s, int sectors,
> @@ -54,6 +63,7 @@ int badblocks_init(struct badblocks *bb, int enable);
> void badblocks_exit(struct badblocks *bb);
> struct device;
> int devm_init_badblocks(struct device *dev, struct badblocks *bb);
> +
> static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
> {
> if (bb->dev != dev) {
> @@ -63,4 +73,27 @@ static inline void devm_exit_badblocks(struct device *dev, struct badblocks *bb)
> }
> badblocks_exit(bb);
> }
> +
> +static inline int badblocks_full(struct badblocks *bb)
> +{
> + return (bb->count >= MAX_BADBLOCKS);
> +}
> +
> +static inline int badblocks_empty(struct badblocks *bb)
> +{
> + return (bb->count == 0);
> +}
> +
> +static inline void set_changed(struct badblocks *bb)
> +{
> + if (bb->changed != 1)
> + bb->changed = 1;
> +}
> +
> +static inline void clear_changed(struct badblocks *bb)
> +{
> + if (bb->changed != 0)
> + bb->changed = 0;
> +}
> +
> #endif
> --
> 2.26.2
>