Re: [PATCH v2 3/4] x86, mwaitt: introduce mwaix delay with a configurable timer

From: Andy Lutomirski
Date: Tue Jun 09 2015 - 14:26:53 EST


On Tue, Jun 9, 2015 at 10:13 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> On Tue, 2015-06-09 at 09:46 -0700, Andy Lutomirski wrote:
>> On Jun 9, 2015 2:30 AM, "Peter Zijlstra" <peterz@xxxxxxxxxxxxx> wrote:
>
>> > How about you think instead and do something like:
>> >
>> > rdtsc(start);
>> > rdtsc_barrier();
>>
>> Other way around. We really need a function static inline u64
>> rdtsc_with_barrier().
>
> So admittedly I have not actually looked at how the tsc barrier stuff
> works, but what? We don't care if the rdtsc goes up, we just want to
> make sure its done before going further.
>

When I looked at the rdtsc ordering a couple years ago, I thought
about what it meant for rdtsc to be properly ordered. I decided that
proper rdtsc ordering meant that no one should ever be able to tell if
rdtsc ends up reordered. Concretely, I think that rdtsc should be
ordered like an x86 load from a shared memory location. The manuals
are vague but, after a decent amount of experimentation,
rdtsc_barrier(); rdtsc() seems to achieve that on all CPUs. With the
barrier, the rdtsc won't happen before a prior load in the same
thread, and no CPU seems to ever execute rdtsc after a subsequent
memory access.

For delay in particular, we care about I/O delays as well, so
presumably we want to guarantee that a prior load or store to be
visible for at least the requested amount of time before the next I/O.
This should be fine here, too, as MMIO stores aren't ordered anyway
(drivers need a dummy load afterwards) and I bet that the in and out
instructions are already ordered strongly enough.

I'll send a patch to improve the headers.

--Andy

>>
>> >
>> > for (;;) {
>> > delay = min(MWAIT_MAX_LOOPS, loops);
>> >
>> > __monitorx(&addr, 0, 0);
>> > mwaitx(delay, true);
>>
>> I don't like this hack. The compiler is entirely within is rights to
>> poke addr's cacheline (i.e. the stack) between the two instructions.
>> I'd suggest either making the thing a full cacheline long or using a
>> single asm statement.
>>
>> Also, "addr" is a bad name for a dummy variable that isn't an address
>> at all. How about "dummy"?
>
> Sure, and I like your question on why monitorx exists at all. But none
> of that was the point here, the main point being that if loops was too
> big, we should do multiple mwaitx invocations, not punt and busy loop.
>
>> > rdtsc_barrier();
>> > rdtsc(end);
>> > rdtsc_barrier();
>>
>> The second barrier is unnecessary.
>
> By virtue of the address dependency?

No, it's just that CPUs seem to work this way.

--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/