Re: [PATCH md-6.9 03/10] md/raid1: fix choose next idle in read_balance()

From: Luse, Paul E
Date: Tue Feb 27 2024 - 09:27:13 EST




> On Feb 26, 2024, at 9:49 PM, Xiao Ni <xni@xxxxxxxxxx> wrote:
>
> On Tue, Feb 27, 2024 at 10:38 AM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>>
>> Hi,
>>
>> 在 2024/02/27 10:23, Xiao Ni 写道:
>>> On Thu, Feb 22, 2024 at 4:04 PM Yu Kuai <yukuai1@xxxxxxxxxxxxxxx> wrote:
>>>>
>>>> From: Yu Kuai <yukuai3@xxxxxxxxxx>
>>>>
>>>> Commit 12cee5a8a29e ("md/raid1: prevent merging too large request") add
>>>> the case choose next idle in read_balance():
>>>>
>>>> read_balance:
>>>> for_each_rdev
>>>> if(next_seq_sect == this_sector || disk == 0)
>>>> -> sequential reads
>>>> best_disk = disk;
>>>> if (...)
>>>> choose_next_idle = 1
>>>> continue;
>>>>
>>>> for_each_rdev
>>>> -> iterate next rdev
>>>> if (pending == 0)
>>>> best_disk = disk;
>>>> -> choose the next idle disk
>>>> break;
>>>>
>>>> if (choose_next_idle)
>>>> -> keep using this rdev if there are no other idle disk
>>>> contine
>>>>
>>>> However, commit 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> remove the code:
>>>>
>>>> - /* If device is idle, use it */
>>>> - if (pending == 0) {
>>>> - best_disk = disk;
>>>> - break;
>>>> - }
>>>>
>>>> Hence choose next idle will never work now, fix this problem by
>>>> following:
>>>>
>>>> 1) don't set best_disk in this case, read_balance() will choose the best
>>>> disk after iterating all the disks;
>>>> 2) add 'pending' so that other idle disk will be chosen;
>>>> 3) set 'dist' to 0 so that if there is no other idle disk, and all disks
>>>> are rotational, this disk will still be chosen;
>>>>
>>>> Fixes: 2e52d449bcec ("md/raid1: add failfast handling for reads.")
>>>> Co-developed-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx>
>>>> Signed-off-by: Paul Luse <paul.e.luse@xxxxxxxxxxxxxxx>
>>>> Signed-off-by: Yu Kuai <yukuai3@xxxxxxxxxx>
>>>> ---
>>>> drivers/md/raid1.c | 21 ++++++++++++---------
>>>> 1 file changed, 12 insertions(+), 9 deletions(-)
>>>>
>>>> diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
>>>> index c60ea58ae8c5..d0bc67e6d068 100644
>>>> --- a/drivers/md/raid1.c
>>>> +++ b/drivers/md/raid1.c
>>>> @@ -604,7 +604,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> unsigned int min_pending;
>>>> struct md_rdev *rdev;
>>>> int choose_first;
>>>> - int choose_next_idle;
>>>>
>>>> /*
>>>> * Check if we can balance. We can balance on the whole
>>>> @@ -619,7 +618,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> best_pending_disk = -1;
>>>> min_pending = UINT_MAX;
>>>> best_good_sectors = 0;
>>>> - choose_next_idle = 0;
>>>> clear_bit(R1BIO_FailFast, &r1_bio->state);
>>>>
>>>> if ((conf->mddev->recovery_cp < this_sector + sectors) ||
>>>> @@ -712,7 +710,6 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> int opt_iosize = bdev_io_opt(rdev->bdev) >> 9;
>>>> struct raid1_info *mirror = &conf->mirrors[disk];
>>>>
>>>> - best_disk = disk;
>>>> /*
>>>> * If buffered sequential IO size exceeds optimal
>>>> * iosize, check if there is idle disk. If yes, choose
>>>> @@ -731,15 +728,21 @@ static int read_balance(struct r1conf *conf, struct r1bio *r1_bio, int *max_sect
>>>> mirror->next_seq_sect > opt_iosize &&
>>>> mirror->next_seq_sect - opt_iosize >=
>>>> mirror->seq_start) {
>>>> - choose_next_idle = 1;
>>>> - continue;
>>>> + /*
>>>> + * Add 'pending' to avoid choosing this disk if
>>>> + * there is other idle disk.
>>>> + * Set 'dist' to 0, so that if there is no other
>>>> + * idle disk and all disks are rotational, this
>>>> + * disk will still be chosen.
>>>> + */
>>>> + pending++;
>>>> + dist = 0;
>>>> + } else {
>>>> + best_disk = disk;
>>>> + break;
>>>> }
>>>> - break;
>>>> }
>>>
>>> Hi Kuai
>>>
>>> I noticed something. In patch 12cee5a8a29e, it sets best_disk if it's
>>> a sequential read. If there are no other idle disks, it will read from
>>> the sequential disk. With this patch, it reads from the
>>> best_pending_disk even min_pending is not 0. It looks like a wrong
>>> behaviour?
>>
>> Yes, nice catch, I didn't notice this yet... So there is a hidden
>> logical, sequential IO priority is higher than minimal 'pending'
>> selection, it's only less than 'choose_next_idle' where idle disk
>> exist.
>
> Yes.
>
>
>>
>> Looks like if we want to keep this behaviour, we can add a 'sequential
>> disk':
>>
>> if (is_sequential())
>> if (!should_choose_next())
>> return disk;
>> ctl.sequential_disk = disk;
>>
>> ...
>>
>> if (ctl.min_pending != 0 && ctl.sequential_disk != -1)
>> return ctl.sequential_disk;
>
> Agree with this, thanks :)
>
> Best Regards
> Xiao

Yup, agree as well. This will help for sure with the followup to this series for seq read improvements :)

>>
>> Thanks,
>> Kuai
>>
>>>
>>> Best Regards
>>> Xiao
>>>>
>>>> - if (choose_next_idle)
>>>> - continue;
>>>> -
>>>> if (min_pending > pending) {
>>>> min_pending = pending;
>>>> best_pending_disk = disk;
>>>> --
>>>> 2.39.2
>>>>
>>>>
>>>
>>> .
>>>
>>
>
>