Re: [Question]: try to fix contention between expire_timers and try_to_del_timer_sync

From: qiaozhou
Date: Tue Aug 01 2017 - 03:37:39 EST




On 2017å07æ31æ 19:20, qiaozhou wrote:


On 2017å07æ29æ 03:09, Vikram Mulukutla wrote:
On 2017-07-28 02:28, Will Deacon wrote:
On Thu, Jul 27, 2017 at 06:10:34PM -0700, Vikram Mulukutla wrote:

<snip>

Does bodging cpu_relax to back-off to wfe after a while help? The event
stream will wake it up if nothing else does. Nasty patch below, but I'd be
interested to know whether or not it helps.

Will

The patch also helps a lot on my platform. (Though it does cause deadlock(related with udelay) in uart driver in early boot, and not sure it's uart driver issue. Just workaround it firstly)

Platform: 4 a53(832MHz) + 4 a73(1.8GHz)
Test condition #1:
a. core2: a53, while loop (spinlock, spin_unlock)
b. core7: a73, while loop (spinlock, spin_unlock, cpu_relax)

Test result: recording the lock acquire times(a53, a73), max lock acquired time(a53), in 20 seconds

Without cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
182| 38371616| 1,951,954|
202| 38427652| 2,261,319|
210| 38477427| 15,309,597|
207| 38494479| 6,656,453|
220| 38422283| 2,064,155|
===============================================================

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
1849898| 37799379| 131,255|
1574172| 38557653| 38,410|
1924777| 37831725| 42,999|
1477665| 38723741| 52,087|
1865793| 38007741| 783,965|
===============================================================

Also add some workload to the whole system to check the result.
Test condition #2: based on #1
c. core6: a73, 1.8GHz, run "while(1);" loop

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
20| 42563981| 2,317,070|
10| 42652793| 4,210,944|
9| 42651075| 5,691,834|
28| 42652591| 4,539,555|
10| 42652801| 5,850,639|
===============================================================

Also hotplug out other cores.
Test condition #3: based on #1
d. hotplug out core1/3/4/5/6, keep core0 for scheduling

With cpu_relax bodging patch:
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
447| 42652450| 309,549|
515| 42650382| 337,661|
415| 42646669| 628,525|
431| 42651137| 365,862|
464| 42648916| 379,934|
===============================================================

The last two tests are the actual cases where the hard-lockup is triggered on my platform. So I gathered some data, and it shows that a53 needs much longer time to acquire the lock.

All tests are done in android, black screen with USB cable attached. The data is not so pretty as Vikram's. It might be related with cpu topology, core numbers, CCI frequency etc. (I'll do another test with both a53 and a73 running at 1.2GHz, to check whether it's the core frequency which leads to the major difference.)

Test the contention with the same frequency between a53 and a73 cores.
Platform: 4 a53(1248MHz) + 4 a73(1248MHz)
Test condition #4:
a. core2: a53, while loop (spinlock, spin_unlock)
b. core7: a73, while loop (spinlock, spin_unlock)
===============================================================
|a53 locked times | a73 locked times | a53 max locked time(us)|
==================|==================|========================|
12945632| 13021576| 14|
12934181| 13059230| 16|
12987186| 13059016| 49|
12958583| 13038884| 24|
14637546| 14672522| 14|
===============================================================

The locked times are almost the same, and the max time of acquiring the lock on a53 also drops. On my platform, core frequency seems to be the key factor.

This does seem to help. Here's some data after 5 runs with and without the patch.

time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
117893us| 2349144| 2us| 6748236|
571260us| 2125651| 2us| 7643264|
19780us| 2392770| 2us| 5987203|
19948us| 2395413| 2us| 5977286|
19822us| 2429619| 2us| 5768252|
19888us| 2444940| 2us| 5675657|
=====================================================

cpu0: little cpu @ 300MHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
3us| 2737438| 2us| 6907147|
2us| 2742478| 2us| 6902241|
132us| 2745636| 2us| 6876485|
3us| 2744554| 2us| 6898048|
3us| 2741391| 2us| 6882901|
==================================================== >
The patch also seems to have helped with fairness in general
allowing more work to be done if the CPU frequencies are more
closely matched (I don't know if this translates to real world
performance - probably not). The counter values are higher
with the patch.

time = max time taken to acquire lock
counter = number of times lock acquired

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
Without the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
2us| 5240654| 1us| 5339009|
2us| 5287797| 97us| 5327073|
2us| 5237634| 1us| 5334694|
2us| 5236676| 88us| 5333582|
84us| 5285880| 84us| 5329489|
=====================================================

cpu0: little cpu @ 1.5GHz, cpu4: Big cpu @2.0GHz
With the cpu_relax() bodging patch:
=====================================================
cpu0 time | cpu0 counter | cpu4 time | cpu4 counter |
==========|==============|===========|==============|
140us| 10449121| 1us| 11154596|
1us| 10757081| 1us| 11479395|
83us| 10237109| 1us| 10902557|
2us| 9871101| 1us| 10514313|
2us| 9758763| 1us| 10391849|
=====================================================
Also apply Vikram's patch and have a test.

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
Without cpu_relax bodging patch
=====================================================
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==========|==============|===========|==============|
16505| 5243| 2| 12487322|
16494| 5619| 1| 12013291|
16498| 5276| 2| 11706824|
16494| 7123| 1| 12532355|
16470| 7208| 2| 11784617|
=====================================================

cpu2: a53, 832MHz, cpu7: a73, 1.75Hz
With cpu_relax bodging patch:
=====================================================
cpu2 time | cpu2 counter | cpu7 time | cpu7 counter |
==========|==============|===========|==============|
3991| 140714| 1| 11430528|
4018| 144371| 1| 11430528|
4034| 143250| 1| 11427011|
4330| 147345| 1| 11423583|
4752| 138273| 1| 11433241|
=====================================================

It has some improvements, but not so good as Vikram's data. The big core still has much more chance to acquire lock.

Thanks,
Vikram