Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible

From: Jan Beulich
Date: Fri Mar 25 2011 - 06:05:24 EST


>>> On 24.03.11 at 18:19, Ingo Molnar <mingo@xxxxxxx> wrote:
> * Jan Beulich <JBeulich@xxxxxxxxxx> wrote:
>> Are you certain? Iirc the lock prefix implies minimally a read-for-
>> ownership (if CPUs are really smart enough to optimize away the
>> write - I wonder whether that would be correct at all when it
>> comes to locked operations), which means a cacheline can still be
>> bouncing heavily.
>
> Yeah. On what workload was this?
>
> Generally you use test_and_set_bit() if you expect it to be 'owned' by
> whoever calls it, and released by someone else.
>
> It would be really useful to run perf top on an affected box and see which
> kernel function causes this. It might be better to add a test_bit() to the
> affected codepath - instead of bloating all test_and_set_bit() users.

Indeed, I agree with you and Linus in this aspect.

> Note that the patch can also cause overhead: the test_bit() can miss the
> cache, it will bring in the cacheline shared, and the subsequent test_and_set()
> call will then dirty the cacheline - so the CPU might miss again and has to wait
> for other CPUs to first flush this cacheline.
>
> So we really need more details here.

The problem was observed with __lock_page() (in a variant not
upstream for reasons not known to me), and prefixing e.g.
trylock_page() with an extra PageLocked() check yielded the
below quoted improvements.

Jack - were there any similar measurements done on upstream
code?

Jan


**** Quoting Jack Steiner <steiner@xxxxxxx> ****

The following tests were run on UVSW :
768p Westmere
128 nodes


Boot times - greater than 2X reduction in boot time:
2286s PTF #8
1899s PTF #8
975s new algorithm
962s new algorithm

Boot messages referring to udev timeouts - eliminated:
(After the udevadm settle timeout, the events queue contains):

7174 PTF #8
9435 PTF #8
0 new algorithm
0 new algorithm

AIM7 results - no difference at low numbers of tasks. Improvements at high counts:
Jobs/Min at 2000 users
5100 PTF #8
17750 new algorithm

Wallclock seconds to run test at 2000 users
2250s PTF #8
650s new algorithm

CPU Seconds at 2000 users
1300000 PTF #8
14000 new algorithm


Test of large parallel app faulting for text.

Text resident in page cache (10000 pages):
REAL USER SYS
22.830s 23m5.567s 85m59.042s PTF #8 run1
26.267s 34m3.536s 104m20.035s PTF #8 run2
10.890s 19m27.305s 39m50.949s new algorithm run1
10.860s 20m42.698s 40m48.889s new algorithm run2

Text on Disk (1000 pages)
REAL USER SYS
31.658s 9m25.379s 71m11.967s PTF #8
24.348s 6m15.323s 45m27.578s new algorithm

_________________________________________________________________________________
The following tests were run on UV48:
4 racks
256 sockets
2452p westmere

Boot time:
4562 sec PTF#8
1965 sec new

MPI "helloworld" with 1024 ranks
35 sec PTF #8
22 sec new


Test of large parallel app faulting for text.
Text resident in page cache (10000 pages):
REAL USER SYS
46.394s 141m19s 366m53s PTF #8
38.986s 137m36 264m52s PTF #8
7.987s 34m50s 42m36s new algorithm
10.550s 43m31s 59m45s new algorithm


AIM7 Results (this is the original AIM7 - not the recent opensource version)
------------------------------
Jobs/Min
TASKS PTF #8 new
1 487.8 486.6
10 4405.8 4940.6
100 18570.5 18198.9
1000 17262.3 17167.1
2000 4879.3 18163.9
4000 ** 18846.2
------------------------------
Real Seconds
TASKS PTF #8 new
1 11.9 12.0
10 13.2 11.8
100 31.3 32.0
1000 337.2 339.0
2000 2385.6 640.8
4000 ** 1235.3
------------------------------
CPU Seconds
TASKS PTF #8 new
1 1.6 1.6
10 11.5 12.9
100 132.2 137.2
1000 4486.5 6586.3
2000 1758419.7 27845.7
4000 ** 65619.5

** Timed out


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/