Re: [PATCH] blk-mq: Improvements to the hybrid polling sleep time calculation
From: Stephen Bates
Date: Tue Aug 29 2017 - 11:33:23 EST
>> From: Stephen Bates <sbates@xxxxxxxxxxxx>
>>
>> Hybrid polling currently uses half the average completion time as an
>> estimate of how long to poll for. We can improve upon this by noting
>> that polling before the minimum completion time makes no sense. Add a
>> sysfs entry to use this fact to improve CPU utilization in certain
>> cases.
>>
>> At the same time the minimum is a bit too long to sleep for since we
>> must factor in OS wake time for the thread. For now allow the user to
>> set this via a second sysfs entry (in nanoseconds).
>>
>> Testing this patch on Intel Optane SSDs showed that using the minimum
>> rather than half reduced CPU utilization from 59% to 38%. Tuning
>> this via the wake time adjustment allowed us to trade CPU load for
>> latency. For example
>>
>> io_poll delay hyb_use_min adjust latency CPU load
>> 1 -1 N/A N/A 8.4 100%
>> 1 0 0 N/A 8.4 57%
>> 1 0 1 0 10.3 34%
>> 1 9 1 1000 9.9 37%
>> 1 0 1 2000 8.4 47%
>> 1 0 1 10000 8.4 100%
>>
>> Ideally we will extend this to auto-calculate the wake time rather
>> than have it set by the user.
>
> I don't like this, it's another weird knob that will exist but that
> no one will know how to use. For most of the testing I've done
> recently, hybrid is a win over busy polling - hence I think we should
> make that the default. 60% of mean has also, in testing, been shown
> to be a win. So that's an easy fix/change we can consider.
I do agree that the this is a hard knob to tune. I am however not happy that the current hybrid default may mean we are polling well before the minimum completion time. That just seems like a waste of CPU resources to me. I do agree that turning on hybrid as the default and perhaps bumping up the default is a good idea.
> To go beyond that, I'd much rather see us tracking the time waste.
> If we consider the total completion time of an IO to be A+B+C, where:
>
> A Time needed to go to sleep
> B Sleep time
> C Time needed to wake up
>
> then we could feasibly track A+C. We already know how long the IO
> will take to complete, as we track that. At that point we'd have
> a full picture of how long we should sleep.
Yes, this is where I was thinking of taking this functionality in the long term. It seems like tracking C is something other parts of the kernel might need. Does anyone know of any existing code in this space?
> Bonus points for informing the lower level scheduler of this as
> well. If the CPU is going idle, we'll enter some sort of power
> state in the processor. If we were able to pass in how long we
> expect to sleep, we could be making better decisions here.
Yup. Again, this seems like something more general that just the block-layer. I will do some digging and see/if anything is available to leverage here.
Cheers
Stephen