Re: [RFC PATCH v3 for 4.15 08/24] Provide cpu_opv system call

From: Mathieu Desnoyers
Date: Mon Nov 20 2017 - 13:38:18 EST


----- On Nov 20, 2017, at 12:48 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:

> On Mon, 20 Nov 2017, Mathieu Desnoyers wrote:
>> ----- On Nov 16, 2017, at 6:26 PM, Thomas Gleixner tglx@xxxxxxxxxxxxx wrote:
>> >> +#define NR_PINNED_PAGES_ON_STACK 8
>> >
>> > 8 pinned pages on stack? Which stack?
>>
>> The common cases need to touch few pages, and we can keep the
>> pointers in an array on the kernel stack within the cpu_opv system
>> call.
>>
>> Updating to:
>>
>> /*
>> * Typical invocation of cpu_opv need few pages. Keep struct page
>> * pointers in an array on the stack of the cpu_opv system call up to
>> * this limit, beyond which the array is dynamically allocated.
>> */
>> #define NR_PIN_PAGES_ON_STACK 8
>
> That name still sucks. NR_PAGE_PTRS_ON_STACK would be immediately obvious.

fixed.

>
>> >> + * The operations available are: comparison, memcpy, add, or, and, xor,
>> >> + * left shift, and right shift. The system call receives a CPU number
>> >> + * from user-space as argument, which is the CPU on which those
>> >> + * operations need to be performed. All preparation steps such as
>> >> + * loading pointers, and applying offsets to arrays, need to be
>> >> + * performed by user-space before invoking the system call. The
>> >
>> > loading pointers and applying offsets? That makes no sense.
>>
>> Updating to:
>>
>> * All preparation steps such as
>> * loading base pointers, and adding offsets derived from the current
>> * CPU number, need to be performed by user-space before invoking the
>> * system call.
>
> This still does not explain anything, really.
>
> Which base pointer is loaded? I nowhere see a reference to a base
> pointer.
>
> And what are the offsets about?
>
> derived from current cpu number? What is current CPU number? The one on
> which the task executes now or the one which it should execute on?
>
> I assume what you want to say is:
>
> All pointers in the ops must have been set up to point to the per CPU
> memory of the CPU on which the operations should be executed.
>
> At least that's what I oracle in to that.

Exactly that. Will update to use this description instead.

>
>> >> + * "comparison" operation can be used to check that the data used in the
>> >> + * preparation step did not change between preparation of system call
>> >> + * inputs and operation execution within the preempt-off critical
>> >> + * section.
>> >> + *
>> >> + * The reason why we require all pointer offsets to be calculated by
>> >> + * user-space beforehand is because we need to use get_user_pages_fast()
>> >> + * to first pin all pages touched by each operation. This takes care of
>> >
>> > That doesnt explain it either.
>>
>> What kind of explication are you looking for here ? Perhaps being too close
>> to the implementation prevents me from understanding what is unclear from
>> your perspective.
>
> What the heck are pointer offsets?
>
> The ops have one or two pointer(s) to a lump of memory. So if a pointer
> points to the wrong lump of memory then you're screwed, but that's true for
> all pointers handed to the kernel.

I think the sentence you suggested above is clear enough. I'll simply use
it.

>
>> Sorry, that paragraph was unclear. Updated:
>>
>> * An overall maximum of 4216 bytes in enforced on the sum of operation
>> * length within an operation vector, so user-space cannot generate a
>> * too long preempt-off critical section (cache cold critical section
>> * duration measured as 4.7Âs on x86-64). Each operation is also limited
>> * a length of PAGE_SIZE bytes,
>
> Again PAGE_SIZE is the wrong unit here. PAGE_SIZE can vary. What you want
> is a hard limit of 4K. And because there is no alignment requiremnt the
> rest of the sentence is stating the obvious.

I can make that a 4K limit if you prefer. This presumes that no architecture
has pages smaller than 4K, which is true on Linux.

>
>> * meaning that an operation can touch a
>> * maximum of 4 pages (memcpy: 2 pages for source, 2 pages for
>> * destination if addresses are not aligned on page boundaries).
>
> I still have to understand why the 4K copy is necessary in the first place.
>
>> > What's the critical section duration for operations which go to the limits
>> > of this on a average x86 64 machine?
>>
>> When cache-cold, I measure 4.7 Âs per critical section doing a
>> 4k memcpy and 15 * 8 bytes memcpy on a E5-2630 v3 @2.4GHz. Is it an
>> acceptable preempt-off latency for RT ?
>
> Depends on the use case as always ....

The use-case for 4k memcpy operation is a per-cpu ring buffer where
the rseq fast-path does the following:

- ring buffer push: in the rseq asm instruction sequence, a memcpy of a
given structure (limited to 4k in size) into a ring buffer,
followed by the final commit instruction which increments the current
position offset by the number of bytes pushed.

- ring buffer pop: in the rseq asm instruction sequence, a memcpy of
a given structure (up to 4k) from the ring buffer, at "position" offset.
The final commit instruction decrements the current position offset by
the number of bytes pop'd.

Having cpu_opv do a 4k memcpy allow it to handle scenarios where
rseq fails to progress.

Thanks,

Mathieu



>
> Thanks,
>
> tglx

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com