Re: [PATCH 5/7] RISC-V: arch/riscv/lib

From: Palmer Dabbelt
Date: Fri Jun 23 2017 - 19:24:54 EST

On Wed, 07 Jun 2017 00:35:04 PDT (-0700), Arnd Bergmann wrote:
> On Tue, Jun 6, 2017 at 10:53 PM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>> On Tue, 06 Jun 2017 02:31:02 PDT (-0700), Arnd Bergmann wrote:
>>> On Tue, Jun 6, 2017 at 6:56 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>>>> On Fri, 26 May 2017 02:06:58 PDT (-0700), Arnd Bergmann wrote:
>>>>> On Thu, May 25, 2017 at 3:59 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>>>>>> On Tue, 23 May 2017 04:19:42 PDT (-0700), Arnd Bergmann wrote:
>>>>>>> On Tue, May 23, 2017 at 2:41 AM, Palmer Dabbelt <palmer@xxxxxxxxxxx> wrote:
>>>>>>> Also, it would be good to replace the multiply+div64
>>>>>>> with a single multiplication here, see how x86 and arm do it
>>>>>>> (for the tsc/__timer_delay case).
>>>>>> Makes sense. I think this should do it
>>>>>> but I'm finding this hard to test as this only works for 2ms sleeps. It seems
>>>>>> at least in the right ballpark
>>>>> + if (usecs > MAX_UDELAY_US) {
>>>>> + __delay((u64)usecs * riscv_timebase / 1000000ULL);
>>>>> + return;
>>>>> + }
>>>>> You still do the 64-bit division here. What I meant is to completely
>>>>> avoid the division and use a multiply+shift.
>>>> The goal here was to avoid the error case that ARM has on overflow and instead
>>>> just delay for the requested time. This should only divide when the delay is
>>>>>=2ms, so the division won't cost much in comparison.
>>>> The normal case should have no division in it.
>>>> I can copy ARM's error handling if you think that's better, but it seemed more
>>>> complicated than just computing the correct answer.
>>> I think the intention originally was to avoid overflowing the 32-bit
>>> argument in
>>> void __delay(unsigned long cycles)
>>> If you need to delay for more than 4 billion clocksource cycles,
>>> your code is still broken.
>> Maybe I'm crazy, but I thought the goal was to avoid overflowing on the
>> multiply. Specifically, the code looks like
>> udelay(long input) {
>> long a = input * MUL_VAL;
>> long b = a >> SHIFT_VAL;
>> __delay(b);
>> }
>> so the place there's extra overflow is at computing a, not b (the input to
>> __delay). When I modified the ARM code I went and recalculated the point at
>> which the multiply would overflow and it matched the value from the ARM code,
>> which is 2000us.
> Ah, that's right. But then you might run into the next overflow at a larger
> delay interval.

Well, that requires a huge delay as there's a (u64) cast in there. The
timebase is a u32 which puts the delay at an hour, which seems unlikely :).

>> While I can buy the argument that 2000us is still too long, the real reason I
>> wrote the code this way is because I thought it was easier than having an error
>> case. If you think the error is better then I'll do it that way.
> It's probably ok as long as the complexity is not in the inline wrapper, and you
> avoid the division for small delays.

I put an unlikely() around the comparison and it looks like the complier does
well enough: there are no references to riscv_timebase outside of udelay (so
it's not inline anywhere), and there's just one branch on the hot path to test
for overflow (which is probably free anyway, as it's in the shadow of a mul and
never taken).