Re: [PATCHv4 26/28] x86/vdso: Align VDSO functions by CPU L1 cache line

From: Andrei Vagin
Date: Sun Jun 23 2019 - 01:35:05 EST


On Fri, Jun 14, 2019 at 04:13:31PM +0200, Thomas Gleixner wrote:
> On Wed, 12 Jun 2019, Dmitry Safonov wrote:
>
> > From: Andrei Vagin <avagin@xxxxxxxxx>
> >
> > After performance testing VDSO patches a noticeable 20% regression was
> > found on gettime_perf selftest with a cold cache.
> > As it turns to be, before time namespaces introduction, VDSO functions
> > were quite aligned to cache lines, but adding a new code to adjust
> > timens offset inside namespace created a small shift and vdso functions
> > become unaligned on cache lines.
> >
> > Add align to vdso functions with gcc option to fix performance drop.
> >
> > Coping the resulting numbers from cover letter:
> >
> > Hot CPU cache (more gettime_perf.c cycles - the better):
> > | before | CONFIG_TIME_NS=n | host | inside timens
> > --------|------------|------------------|-------------|-------------
> > cycles | 139887013 | 139453003 | 139899785 | 128792458
> > diff (%)| 100 | 99.7 | 100 | 92
>
> Why is CONFIG_TIME_NS=n behaving worse than current mainline and
> worse than 'host' mode?

We had to specify a precision of these numbers, it is more than this
0.3%, so at that time I decided that here is nothing to worry about. I
did these measurments a few mounth ago for the second version of this
series. I repeated measurments for this set of patches:

| before | CONFIG_TIME_NS=n | host | inside timens
--------------------------------------------------------------
| 144645498 | 142916801 | 140364862 | 132378440
| 143440633 | 141545739 | 140540053 | 132714190
| 144876395 | 144650599 | 140026814 | 131843318
| 143984551 | 144595770 | 140359260 | 131683544
| 144875682 | 143799788 | 140692618 | 131300332
--------------------------------------------------------------
avg | 144364551 | 143501739 | 140396721 | 131983964
diff % | 100 | 99.4 | 97.2 | 91.4
-------------------------------------------------------------
stdev % | 0.4 | 0.9 | 0.1 | 0.4

>
> > Cold cache (lesser tsc per gettime_perf_cold.c cycle - the better):
> > | before | CONFIG_TIME_NS=n | host | inside timens
> > --------|------------|------------------|-------------|-------------
> > tsc | 6748 | 6718 | 6862 | 12682
> > diff (%)| 100 | 99.6 | 101.7 | 188
>
> Weird, now CONFIG_TIME_NS=n is better than current mainline and 'host' mode
> drops.

The precision of these numbers is much smaller than of the previous set.
These numbers are for the second version of this series, so I decided to
repeat measurements for this version. When I run the test, I found that
there is some degradation in compare with v5.0. I bisected and found
that the problem is in 2b539aefe9e4 ("mm/resource: Let
walk_system_ram_range() search child resources"). At this point, I
realized that my test isn't quite right. On each iteration, the test
starts a new process, then do start=rdtsc();clock_gettime();end=rdtsc()
and prints (end-start). The problem here is that when clock_gettime() is
called the first time, vdso pages are not mapped into a process address
space, so the test measures how fast vdso pages are mapped into the
process address space. I modified this test, now it uses the clflush
instruction to drop cpu caches. Here are the results:

| before | CONFIG_TIME_NS=n | host | inside timens
--------------------------------------------------------------
tsc | 434 | 433 | 437 | 477
stdev(tsc) | 5 | 5 | 5 | 3
diff (%) | 1 | 1 | 100.1 | 109

Here is the source code for the modified test:
https://github.com/avagin/linux-task-diag/blob/wip/timens-rfc-v4/tools/testing/selftests/timens/gettime_perf_cold.c

This test does 10K iterations. At the first glance, the numbers look
noisy, so I sort them and take only 8K numbers in the middle:

$ ./gettime_perf_cold > raw
$ cat raw | sort -n | tail -n 9000 | head -n 8000 > results

>
> Either I'm misreading the numbers or missing something or I'm just confused
> as usual :)
>
> Thanks,
> > tglx