Re: [PATCH RFC] x86: avoid atomic operation intest_and_set_bit_lock if possible

From: Jack Steiner
Date: Fri Mar 25 2011 - 09:13:20 EST

Next message: Struk, Tadeusz: "RE: [PATCH] RFC4106 AES-GCM Driver - fixed problem with packetsthat are not multiple of 64bytes"
Previous message: Steven Rostedt: "Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()"
In reply to: Nikanth Karthikesan: "Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible"
Next in thread: Linus Torvalds: "Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lockif possible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Mar 25, 2011 at 10:06:10AM +0000, Jan Beulich wrote:
> >>> On 24.03.11 at 18:19, Ingo Molnar <mingo@xxxxxxx> wrote:
> > * Jan Beulich <JBeulich@xxxxxxxxxx> wrote:
> >> Are you certain? Iirc the lock prefix implies minimally a read-for-
> >> ownership (if CPUs are really smart enough to optimize away the
> >> write - I wonder whether that would be correct at all when it
> >> comes to locked operations), which means a cacheline can still be
> >> bouncing heavily.
> >
> > Yeah. On what workload was this?
> >
> > Generally you use test_and_set_bit() if you expect it to be 'owned' by
> > whoever calls it, and released by someone else.
> >
> > It would be really useful to run perf top on an affected box and see which
> > kernel function causes this. It might be better to add a test_bit() to the
> > affected codepath - instead of bloating all test_and_set_bit() users.
>
> Indeed, I agree with you and Linus in this aspect.
>
> > Note that the patch can also cause overhead: the test_bit() can miss the
> > cache, it will bring in the cacheline shared, and the subsequent test_and_set()
> > call will then dirty the cacheline - so the CPU might miss again and has to wait
> > for other CPUs to first flush this cacheline.
> >
> > So we really need more details here.
>
> The problem was observed with __lock_page() (in a variant not
> upstream for reasons not known to me), and prefixing e.g.
> trylock_page() with an extra PageLocked() check yielded the
> below quoted improvements.
>
> Jack - were there any similar measurements done on upstream
> code?

Not yet but it is high on my list to test. I suspect a similar problem exists.
I'll post the results as soon as I have them.

>
> Jan
>
>
> **** Quoting Jack Steiner <steiner@xxxxxxx> ****
>
> The following tests were run on UVSW :
> 768p Westmere
> 128 nodes
>
>
> Boot times - greater than 2X reduction in boot time:
> 2286s PTF #8
> 1899s PTF #8
> 975s new algorithm
> 962s new algorithm
>
> Boot messages referring to udev timeouts - eliminated:
> (After the udevadm settle timeout, the events queue contains):
>
> 7174 PTF #8
> 9435 PTF #8
> 0 new algorithm
> 0 new algorithm
>
> AIM7 results - no difference at low numbers of tasks. Improvements at high counts:
> Jobs/Min at 2000 users
> 5100 PTF #8
> 17750 new algorithm
>
> Wallclock seconds to run test at 2000 users
> 2250s PTF #8
> 650s new algorithm
>
> CPU Seconds at 2000 users
> 1300000 PTF #8
> 14000 new algorithm
>
>
> Test of large parallel app faulting for text.
>
> Text resident in page cache (10000 pages):
> REAL USER SYS
> 22.830s 23m5.567s 85m59.042s PTF #8 run1
> 26.267s 34m3.536s 104m20.035s PTF #8 run2
> 10.890s 19m27.305s 39m50.949s new algorithm run1
> 10.860s 20m42.698s 40m48.889s new algorithm run2
>
> Text on Disk (1000 pages)
> REAL USER SYS
> 31.658s 9m25.379s 71m11.967s PTF #8
> 24.348s 6m15.323s 45m27.578s new algorithm
>
> _________________________________________________________________________________
> The following tests were run on UV48:
> 4 racks
> 256 sockets
> 2452p westmere
>
> Boot time:
> 4562 sec PTF#8
> 1965 sec new
>
> MPI "helloworld" with 1024 ranks
> 35 sec PTF #8
> 22 sec new
>
>
> Test of large parallel app faulting for text.
> Text resident in page cache (10000 pages):
> REAL USER SYS
> 46.394s 141m19s 366m53s PTF #8
> 38.986s 137m36 264m52s PTF #8
> 7.987s 34m50s 42m36s new algorithm
> 10.550s 43m31s 59m45s new algorithm
>
>
> AIM7 Results (this is the original AIM7 - not the recent opensource version)
> ------------------------------
> Jobs/Min
> TASKS PTF #8 new
> 1 487.8 486.6
> 10 4405.8 4940.6
> 100 18570.5 18198.9
> 1000 17262.3 17167.1
> 2000 4879.3 18163.9
> 4000 ** 18846.2
> ------------------------------
> Real Seconds
> TASKS PTF #8 new
> 1 11.9 12.0
> 10 13.2 11.8
> 100 31.3 32.0
> 1000 337.2 339.0
> 2000 2385.6 640.8
> 4000 ** 1235.3
> ------------------------------
> CPU Seconds
> TASKS PTF #8 new
> 1 1.6 1.6
> 10 11.5 12.9
> 100 132.2 137.2
> 1000 4486.5 6586.3
> 2000 1758419.7 27845.7
> 4000 ** 65619.5
>
> ** Timed out
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Struk, Tadeusz: "RE: [PATCH] RFC4106 AES-GCM Driver - fixed problem with packetsthat are not multiple of 64bytes"
Previous message: Steven Rostedt: "Re: [PATCH 2/2] mutex: Apply adaptive spinning on mutex_trylock()"
In reply to: Nikanth Karthikesan: "Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lock if possible"
Next in thread: Linus Torvalds: "Re: [PATCH RFC] x86: avoid atomic operation in test_and_set_bit_lockif possible"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]