Re: [PATCH BUGFIX 1/3] null_blk: set a separate timer for each command

From: Paolo Valente
Date: Tue Nov 03 2015 - 04:02:28 EST



Il giorno 02/nov/2015, alle ore 17:14, Jens Axboe <axboe@xxxxxx> ha scritto:

> On 11/02/2015 07:31 AM, Paolo Valente wrote:
>> For the Timer IRQ mode (i.e., when command completions are delayed),
>> there is one timer for each CPU. Each of these timers
>> . has a completion queue associated with it, containing all the
>> command completions to be executed when the timer fires;
>> . is set, and a new completion-to-execute is inserted into its
>> completion queue, every time the dispatch code for a new command
>> happens to be executed on the CPU related to the timer.
>>
>> This implies that, if the dispatch of a new command happens to be
>> executed on a CPU whose timer has already been set, but has not yet
>> fired, then the timer is set again, to the completion time of the
>> newly arrived command. When the timer eventually fires, all its queued
>> completions are executed.
>>
>> This way of handling delayed command completions entails the following
>> problem: if more than one command completion is inserted into the
>> queue of a timer before the timer fires, then the expiration time for
>> the timer is moved forward every time each of these completions is
>> enqueued. As a consequence, only the last completion enqueued enjoys a
>> correct execution time, while all previous completions are unjustly
>> delayed until the last completion is executed (and at that time they
>> are executed all together).
>>
>> Specifically, if all the above completions are enqueued almost at the
>> same time, then the problem is negligible. On the opposite end, if
>> every completion is enqueued a while after the previous completion was
>> enqueued (in the extreme case, it is enqueued only right before the
>> timer would have expired), then every enqueued completion, except for
>> the last one, experiences an inflated delay, proportional to the number
>> of completions enqueued after it. In the end, commands, and thus I/O
>> requests, may be completed at an arbitrarily lower rate than the
>> desired one.
>>
>> This commit addresses this issue by replacing per-CPU timers with
>> per-command timers, i.e., by associating an individual timer with each
>> command.
>
> Functionally the patch looks fine. My only worry is that a timer per command would be an unnecessary slowdown compared to pushing one timer forward. The problem should be fixable by still doing that, just maintaining next-expire instead. Maybe something that would still roughly be precise enough, while still getting some completion batching going? Or maybe that would be slower, and the individual timers are still better.
>
> Comments?
>

I have tried to think about these questions. Unfortunately I was not able to go beyond the following general considerations.

Given that:
1) hrtimer_start is very efficient;
2) to implement a batch-by-batch execution of command completions, a more complex code would of course be needed in null_blk;
I think that, to get a perceptible improvement, command completions should probably be executed in batches of at least 3 or 4 commands each.

In this respect, commands can accumulate in a batch only if the time granularity with which delayed commands are guaranteed to be executed, i.e, the granularity with which timers eventually fire, happens to be coarser than completion_nsec. But this would lead to the exact uncontrolled throughput-loss problem addressed by this patch, although at a lesser extent. The reason is as follows.

Consider a batch of 3 or 4 commands, executed at a certain timer expiration. If the arrival time of these commands is close to the timer expiration, then the commands happen to be executed with a delay close to the expected one. If, on the other hand, their arrival time is far from the timer expiration, then they are completed with a larger delay. In the worst case, their delay is inflated by a factor proportional to the size of the batch, namely 3 or 4.

In the end, depending on the command arrival pattern, the throughput of the emulated device may vary by 3 or 4 times. This would be a reduced version of the issue addressed by this patch. In fact, with the current implementation of delayed completions (per-cpu timers), the throughput may vary, in the worst case, by a factor equal to the queue depth (64 by default).

Finally, a side note: as for whether the increased efficiency of batched delayed completions is worth additional code complexity as well as throughput losses/fluctuations, the low delays for which this efficiency becomes important match the latencies of new very fast devices (or are at least in the same order). But, exactly for the high speed of these devices, what typically matters (AFAIK) in tests with these devices is understanding whether the rest of the system can cope with their speed. And for this type of measures, the best/typical choice is (again, AFAIK) not to delay request completions at all in null_blk.

I hope these considerations may be somehow useful.

Thanks,
Paolo

> --
> Jens Axboe


--
Paolo Valente
Algogroup
Dipartimento di Fisica, Informatica e Matematica
Via Campi, 213/B
41125 Modena - Italy
homepage: http://algogroup.unimore.it/people/paolo/

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/