Re: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among all vcpus
From: Vitaly Kuznetsov
Date: Mon Apr 27 2015 - 09:30:38 EST
KY Srinivasan <kys@xxxxxxxxxxxxx> writes:
>> -----Original Message-----
>> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx]
>> Sent: Friday, April 24, 2015 2:05 AM
>> To: Dexuan Cui
>> Cc: KY Srinivasan; Haiyang Zhang; devel@xxxxxxxxxxxxxxxxxxxxxx; linux-
>> kernel@xxxxxxxxxxxxxxx
>> Subject: Re: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among
>> all vcpus
>>
>> Dexuan Cui <decui@xxxxxxxxxxxxx> writes:
>>
>> >> -----Original Message-----
>> >> From: Vitaly Kuznetsov [mailto:vkuznets@xxxxxxxxxx]
>> >> Sent: Tuesday, April 21, 2015 22:28
>> >> To: KY Srinivasan
>> >> Cc: Haiyang Zhang; devel@xxxxxxxxxxxxxxxxxxxxxx; linux-
>> >> kernel@xxxxxxxxxxxxxxx; Dexuan Cui
>> >> Subject: [PATCH 5/6] Drivers: hv: vmbus: distribute subchannels among all
>> >> vcpus
>> >>
>> >> Primary channels are distributed evenly across all vcpus we have. When
>> the
>> >> host asks us to create subchannels it usually makes us num_cpus-1 offers
>> >
>> > Hi Vitaly,
>> > AFAIK, in the VSP of storvsc, the number of subchannel is
>> > (the_number_of_vcpus - 1) / 4.
>> >
>> > This means for a 8-vCPU guest, there is only 1 subchannel.
>> >
>> > Your new algorithm tends to make the vCPUs with small-number busier:
>> > e.g., in the 8-vCPU case, assuming we have 4 SCSI controllers:
>> > vCPU0: scsi0's PrimaryChannel (P)
>> > vCPU1: scsi0's SubChannel (S) + scsi1's P
>> > vCPU2: scsi1's S + scsi2's P
>> > vCPU3: scsi2's S + scsi3's P
>> > vCPU4: scsi3's S
>> > vCPU5, 6 and 7 are idle.
>> >
>> > In this special case, the existing algorithm is better. :-)
>> >
>> > However, I do like this idea in your patch, that is, making sure a device's
>> > primary/sub channels are assigned to differents vCPUs.
>>
>> Under special circumstances with the current code we can end up with
>> having all subchannels on the same vCPU with the primary channel I guess
>> :-) This is not something common, but possible.
>>
>> >
>> > I'm just wondering if we should use an even better (and complex)
>> > algorithm :-)
>>
>> The question here is - does sticking to the current vCPU help? If it
>> does, I can suggest the following (I think I even mentioned that in my
>> PATCH 00): first we try to find a (sub)channel with target_cpu ==
>> current_vcpu and only when we fail we do the round robin. I'd like to
>> hear K.Y.'s opinion here as he's the original author :-)
>
> Sorry for the delayed response. Initially I had implemented a scheme that would
> pick an outgoing CPU that was closest to the CPU on which the request came (to maintain
> cache locality especially on NUMA systems). I changed this algorithm to spread the load
> more uniformly as we were trying to improve Linux IOPS on Azure XIO
> (premium storage). We are currently testing
> this code on our Converged Offering - CPS and I am finding that the perf as measured by IOS has regressed.
> I have not narrowed the reason for this regression and it may very well be the change in the
> algorithm for selecting the outgoing channel. In general, I don't think the logic here needs to be
> exact and locality (being on the same CPU or within the same NUMA node) is important. Any change
> to this algorithm will have to be validated on different MSFT
> environments (Azure XIO, CPS etc.).
Thanks, can you please compare two algorythms here:
1) Simple round robin (the one my patch series implement but with issues
fixed, I'll send v2).
2) Try to find a (sub)channel with matching VCPU and round-robin when we
fail (I can actually include it in v2).
We can later decide something based on these testing results.
>
> Regards,
>
> K. Y
--
Vitaly
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/