Re: [RFC PATCH 0/4] Scheduler time slice extension
From: Prakash Sangappa
Date: Mon Feb 03 2025 - 22:06:24 EST
> On Dec 16, 2024, at 10:59 AM, Prakash Sangappa <prakash.sangappa@xxxxxxxxxx> wrote:
>
>
>
>> On Dec 9, 2024, at 1:17 PM, Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>
>> On 2024-12-09 15:36, Prakash Sangappa wrote:
>>>> On Nov 14, 2024, at 11:41 AM, Prakash Sangappa <prakash.sangappa@xxxxxxxxxx> wrote:
>>>>
>>>>
>>>>
>>>>> On Nov 14, 2024, at 2:28 AM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
>>>>>
>>>>> On Wed, Nov 13, 2024 at 08:10:52PM +0000, Prakash Sangappa wrote:
>>>>>>
>>>>>>
>>>>>>> On Nov 13, 2024, at 11:36 AM, Mathieu Desnoyers <mathieu.desnoyers@xxxxxxxxxxxx> wrote:
>>>>>>>
>>>>>>> On 2024-11-13 13:50, Peter Zijlstra wrote:
>>>>>>>> On Wed, Nov 13, 2024 at 12:01:22AM +0000, Prakash Sangappa wrote:
>>>>>>>>> This patch set implements the above mentioned 50us extension time as posted
>>>>>>>>> by Peter. But instead of using restartable sequences as API to set the flag
>>>>>>>>> to request the extension, this patch proposes a new API with use of a per
>>>>>>>>> thread shared structure implementation described below. This shared structure
>>>>>>>>> is accessible in both users pace and kernel. The user thread will set the
>>>>>>>>> flag in this shared structure to request execution time extension.
>>>>>>>> But why -- we already have rseq, glibc uses it by default. Why add yet
>>>>>>>> another thing?
>>>>>>>
>>>>>>> Indeed, what I'm not seeing in this RFC patch series cover letter is an
>>>>>>> explanation that justifies adding yet another per-thread memory area
>>>>>>> shared between kernel and userspace when we have extensible rseq
>>>>>>> already.
>>>>>>
>>>>>> It mainly provides pinned memory, can be useful for future use cases
>>>>>> where updating user memory in kernel context can be fast or needs to
>>>>>> avoid pagefaults.
>>>>>
>>>>> 'might be useful' it not good enough a justification. Also, I don't
>>>>> think you actually need this.
>>>>
>>>> Will get back with database benchmark results using rseq API for scheduler time extension.
>>> Sorry about the delay in response.
>>> Here are the database swingbench numbers - includes results with use of rseq API.
>>> Test results:
>>> =========
>>> Test system 2 socket AMD Genoa
>>> Swingbench - standard database benchmark
>>> Cached(database files on tmpfs) run, with 1000 clients.
>>> Baseline(Without Sched time extension): 99K SQL exec/sec
>>> With Sched time extension:
>>> Shared structure API use: 153K SQL exec/sec (Previously reported)
>>> 55% improvement in throughput.
>>> Restartable sequences API use: 147K SQL exec/sec
>>> 48% improvement in throughput
>>> While both show good performance benefit with scheduler time extension,
>>> there is a 7% difference in throughput between Shared structure & Restartable sequences API.
>>> Use of shared structure is faster.
>>
>> Can you share the code for both test cases ? And do you have relevant
>> perf profile showing where time is spent ?
>>
>> Thanks,
>>
>> Mathieu
>>
>
> The changes are in the database(Oracle DB).
> The test is swingbench. https://www.dominicgiles.com/downloads/
>
> Our database team is running the benchmark. I have requested them to repeat the test and
> capture perf profile.
>
Update about the database testing comparing performance difference between the two APIs
Shared structure and Rseq.
The workload shows variations in the runs. Repeating the tests number of times shows the
Performance on average is similar with use of either APIs to request Scheduler time extension.
He shared following results from the Lock table database workload, which simulates SQL query
execution. This shows min, max, median and average of 5 runs with scheduler time extension
enabled and disabled.
Locktable test(# of query executions per sec).
Run Min Max Median Average
====================================================================
Disabled 162,911.10 187,543.40 177,035.30 175,675.24(Baseline)
Enabled(Shared struct) 167,896.20 237,709.50 191,269.50 195,113.80(+11%)
Enabled(Rseq) 166,410.40 251,349.30 191,917.60 199,740.40(+13%)
In this case Rseq shows better performance.
The conclusion seems to be both APIs perform similarly.
How do we proceed?
We/our Database team wants the scheduler time extension feature integrated.
Not particular about the Shared structure API for this.
I can submit the Rseq based patch PeterZ proposed. The copy_from/to_user() calls
can be modified to use the nofault version.
> I was trying to reproduce the performance difference using a microbenchmark.
> Used a modified version of the test(extend-sched.c) the Steven Rosted’s posted here
> https://lore.kernel.org/lkml/20231025054219.1acaa3dd@xxxxxxxxxxxxxxxxxx/
...
>
> The test increments a count in the shared data object indicating number of times it could complete
> the critical section. Higher the number is better.
>
> Running this test on a 50 core VM shows lot of variance in the results. Generally we see
> performance improvement of 4 to 6% with use of either APIs in this test.
> Unfortunately It does not show consistent difference between the two APIs.
...
-Prakash