RE: [Resend Patch 3/3] Storvsc: Select channel based on available percentage of ring buffer to write
From: Long Li
Date: Fri Apr 13 2018 - 20:26:03 EST
> Subject: RE: [Resend Patch 3/3] Storvsc: Select channel based on available
> percentage of ring buffer to write
>
> > -----Original Message-----
> > From: linux-kernel-owner@xxxxxxxxxxxxxxx
> > <linux-kernel-owner@xxxxxxxxxxxxxxx> On Behalf Of Long Li
> > Sent: Tuesday, March 27, 2018 5:49 PM
> > To: KY Srinivasan <kys@xxxxxxxxxxxxx>; Haiyang Zhang
> > <haiyangz@xxxxxxxxxxxxx>; Stephen Hemminger
> <sthemmin@xxxxxxxxxxxxx>;
> > James E . J . Bottomley <JBottomley@xxxxxxxx>; Martin K . Petersen
> > <martin.petersen@xxxxxxxxxx>; devel@xxxxxxxxxxxxxxxxxxxxxx; linux-
> > scsi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > netdev@xxxxxxxxxxxxxxx
> > Cc: Long Li <longli@xxxxxxxxxxxxx>
> > Subject: [Resend Patch 3/3] Storvsc: Select channel based on available
> > percentage of ring buffer to write
> >
> > From: Long Li <longli@xxxxxxxxxxxxx>
> >
> > This is a best effort for estimating on how busy the ring buffer is
> > for that channel, based on available buffer to write in percentage. It
> > is still possible that at the time of actual ring buffer write, the
> > space may not be available due to other processes may be writing at the
> time.
> >
> > Selecting a channel based on how full it is can reduce the possibility
> > that a ring buffer write will fail, and avoid the situation a channel
> > is over busy.
> >
> > Now it's possible that storvsc can use a smaller ring buffer size
> > (e.g. 40k bytes) to take advantage of cache locality.
> >
> > Signed-off-by: Long Li <longli@xxxxxxxxxxxxx>
> > ---
> > drivers/scsi/storvsc_drv.c | 62
> > +++++++++++++++++++++++++++++++++++++---------
> > 1 file changed, 50 insertions(+), 12 deletions(-)
> >
> > diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
> > index a2ec0bc9e9fa..b1a87072b3ab 100644
> > --- a/drivers/scsi/storvsc_drv.c
> > +++ b/drivers/scsi/storvsc_drv.c
> > @@ -395,6 +395,12 @@ MODULE_PARM_DESC(storvsc_ringbuffer_size,
> "Ring
> > buffer size (bytes)");
> >
> > module_param(storvsc_vcpus_per_sub_channel, int, S_IRUGO);
> > MODULE_PARM_DESC(storvsc_vcpus_per_sub_channel, "Ratio of VCPUs
> to
> > subchannels");
> > +
> > +static int ring_avail_percent_lowater = 10;
>
> Reserving 10% of each ring buffer by default seems like more than is needed
> in the storvsc driver. That would be about 4Kbytes for the 40K ring buffer
> you suggest, and even more for a ring buffer of 128K. Each outgoing record is
> only about 344 bytes (I'd have to check exactly). With the new channel
> selection algorithm below, the only time we use a channel that is already
> below the low water mark is when no channel could be found that is above
> the low water mark. There could be a case of two or more threads deciding
> that a channel is above the low water mark at the same time and both
> choosing it, but that's likely to be rare. So it seems like we could set the
It's not rare for two processes checking on the same channel at the same time, when running multiple processes I/O workload. The CPU to channel is not 1:1 mapping.
> default low water mark to 5 percent or even 3 percent, which will let more of
> the ring buffer be used, and let a channel be assigned according to the
> algorithm, rather than falling through to the default because all channels
> appear to be "full".
It seems it's not about how big ring buffer is, e.g. even you have a ring buffer of infinite size, it won't help with performance if it's getting queued all the time, while other ring buffers are near empty. It's more about how multiple ring buffers are getting utilized in a reasonable and balanced way. Testing shows 10 is a good choice, while 3 is prone to return BUSY and trigger block layer retry.
>
> > +module_param(ring_avail_percent_lowater, int, S_IRUGO);
> > +MODULE_PARM_DESC(ring_avail_percent_lowater,
> > + "Select a channel if available ring size > this in percent");
> > +
> > /*
> > * Timeout in seconds for all devices managed by this driver.
> > */
> > @@ -1285,9 +1291,9 @@ static int storvsc_do_io(struct hv_device
> > *device, {
> > struct storvsc_device *stor_device;
> > struct vstor_packet *vstor_packet;
> > - struct vmbus_channel *outgoing_channel;
> > + struct vmbus_channel *outgoing_channel, *channel;
> > int ret = 0;
> > - struct cpumask alloced_mask;
> > + struct cpumask alloced_mask, other_numa_mask;
> > int tgt_cpu;
> >
> > vstor_packet = &request->vstor_packet; @@ -1301,22 +1307,53 @@
> > static int storvsc_do_io(struct hv_device *device,
> > /*
> > * Select an an appropriate channel to send the request out.
> > */
> > -
> > if (stor_device->stor_chns[q_num] != NULL) {
> > outgoing_channel = stor_device->stor_chns[q_num];
> > - if (outgoing_channel->target_cpu == smp_processor_id()) {
> > + if (outgoing_channel->target_cpu == q_num) {
> > /*
> > * Ideally, we want to pick a different channel if
> > * available on the same NUMA node.
> > */
> > cpumask_and(&alloced_mask, &stor_device-
> >alloced_cpus,
> >
> cpumask_of_node(cpu_to_node(q_num)));
> > - for_each_cpu_wrap(tgt_cpu, &alloced_mask,
> > - outgoing_channel->target_cpu + 1) {
> > - if (tgt_cpu != outgoing_channel->target_cpu)
> {
> > - outgoing_channel =
> > - stor_device->stor_chns[tgt_cpu];
> > - break;
> > +
> > + for_each_cpu_wrap(tgt_cpu, &alloced_mask,
> q_num + 1) {
> > + if (tgt_cpu == q_num)
> > + continue;
> > + channel = stor_device->stor_chns[tgt_cpu];
> > + if (hv_get_avail_to_write_percent(
> > + &channel->outbound)
> > + > ring_avail_percent_lowater)
> {
> > + outgoing_channel = channel;
> > + goto found_channel;
> > + }
> > + }
> > +
> > + /*
> > + * All the other channels on the same NUMA node
> are
> > + * busy. Try to use the channel on the current CPU
> > + */
> > + if (hv_get_avail_to_write_percent(
> > + &outgoing_channel-
> >outbound)
> > + > ring_avail_percent_lowater)
> > + goto found_channel;
> > +
> > + /*
> > + * If we reach here, all the channels on the current
> > + * NUMA node are busy. Try to find a channel in
> > + * other NUMA nodes
> > + */
> > + cpumask_andnot(&other_numa_mask,
> > + &stor_device->alloced_cpus,
> > +
> cpumask_of_node(cpu_to_node(q_num)));
> > +
> > + for_each_cpu(tgt_cpu, &other_numa_mask) {
> > + channel = stor_device->stor_chns[tgt_cpu];
> > + if (hv_get_avail_to_write_percent(
> > + &channel->outbound)
> > + > ring_avail_percent_lowater)
> {
> > + outgoing_channel = channel;
> > + goto found_channel;
> > }
> > }
> > }
> > @@ -1324,7 +1361,7 @@ static int storvsc_do_io(struct hv_device *device,
> > outgoing_channel = get_og_chn(stor_device, q_num);
> > }
> >
> > -
> > +found_channel:
> > vstor_packet->flags |= REQUEST_COMPLETION_FLAG;
> >
> > vstor_packet->vm_srb.length = (sizeof(struct vmscsi_request) - @@
> > -1733,7 +1770,8 @@ static int storvsc_probe(struct hv_device *device,
> > }
> >
> > scsi_driver.can_queue = (max_outstanding_req_per_channel *
> > - (max_sub_channels + 1));
> > + (max_sub_channels + 1)) *
> > + (100 - ring_avail_percent_lowater) / 100;
>
> A minor nit, but the use of parentheses here is inconsistent. There's a set of
> parens around the first two expressions to explicitly code the associativity,
> but not a set to encompass the third term, which must be processed before
> the fourth one is. C does multiplication and division with left to right
> associativity, so the result is as intended.
> But if we're depending on C's default associativity, then that set of parens
> around the first two expression really isn't needed, and one wonders why
> they are there.
>
> Michael
>
> >
> > host = scsi_host_alloc(&scsi_driver,
> > sizeof(struct hv_host_device));
> > --
> > 2.14.1