Re: [PATCH] cpufreq, store_scaling_governor requires policy->rwsem to be held for duration of changing governors [v2]

From: Prarit Bhargava
Date: Wed Jul 30 2014 - 10:18:36 EST




On 07/29/2014 08:03 PM, Rafael J. Wysocki wrote:
> On Tuesday, July 29, 2014 07:46:02 AM Prarit Bhargava wrote:
>> [v2]: Fixed wording of commit, and added additional testing information.
>>
>> A while ago we added a test to mimic some of our users' userspace governor
>> programs which monitor system behaviour and will switch governors on the
>> fly. The decision process for this includes looking at time of day,
>> expected system load, etc. For some time now we have had reports of
>> system panics in the cpufreq code when using the userspace governor
>> utility.
>>
>> The userspace utility writes
>> /sys/devices/system/cpu/cpuX/cpufreq/scaling_governor and sets the
>> governor. In some cases this can happen rapidly, and under heavy load
>> there are occasions where the changes are delayed. This can mean that
>> several governor changes may occur within a short period of time.
>>
>> This has exposed a bug in the store_scaling_governor path. When the sysfs
>> file is written to,
>>
>> store()
>> ->down_write(&policy->rwsem);
>> ->store_scaling_governor()
>> -> cpufreq_set_policy()
>> up_write(&policy->rwsem);
>> __cpufreq_governor(policy, CPUFREQ_GOV_POLICY_EXIT);
>> down_write(&policy->rwsem);
>>
>> The release of the policy->rwsem results in a situation where another
>> write to the scaling_governor file can now start and overwrite pointers
>> and cause corruption.
>>
>> ex)
>>
>> Oops: 0000 [#1] SMP
>> Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
>> CPU: 7 PID: 565 Comm: kworker/7:2 Not tainted 3.10.0-121.el7.x86_64 #1
>> Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
>> Workqueue: events od_dbs_timer
>> task: ffff88046544cfa0 ti: ffff880465438000 task.ti: ffff880465438000
>> RIP: 0010:[<ffffffff81489bed>] [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
>> RSP: 0018:ffff880465439de0 EFLAGS: 00010282
>> RAX: ffff88007f03d500 RBX: 0000000000010de0 RCX: 0000000000000006
>> RDX: 0000000000000006 RSI: 7040000000000000 RDI: ffff88046f2f0e08
>> e1000e 0000:06:00.0: irq 79 for MSI/MSI-X
>> IPv6: ADDRCONF(NETDEV_UP): enp6s0f0: link is not ready
>> RBP: ffff880465439e18 R08: ffff88046f2f0e10 R09: dfedcb48370f0e08
>> R10: dfedcb48370f0e08 R11: 0000000000000000 R12: ffff880465a69200
>> R13: 0000000000000000 R14: ffff88046f2f0e08 R15: 00000000000001c0
>> e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
>> FS: 0000000000000000(0000) GS:ffff88046f2e0000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000000000000010 CR3: 0000000466be5000 CR4: 00000000000007e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Stack:
>> ffff88045ef8d140 ffff880468e1ad80 ffff88046f2f0e08 ffff880465a69200
>> ffff88046f2f3e00 ffff88046f2f7f00 00000000000001c0 ffff880465439e60
>> ffffffff8107e02b 000000006f2f3e18 0000000000000000 ffff88046f2f3e18
>> Call Trace:
>> [<ffffffff8107e02b>] process_one_work+0x17b/0x460
>> [<ffffffff8107edfb>] worker_thread+0x11b/0x400
>> [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
>> [<ffffffff81085aef>] kthread+0xcf/0xe0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> e1000e 0000:06:00.1: irq 80 for MSI/MSI-X
>> IPv6: ADDRCONF(NETDEV_UP): enp6s0f1: link is not ready
>> [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> Code: 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 8b 50 14 4c 8b 68 40 89 d1 <4d> 8b 7d 10 89 55 cc 48 03 1c cd 80 78 9e 81 48 8d 83 a8 00 00
>> RIP [<ffffffff81489bed>] od_dbs_timer+0x2d/0x160
>> RSP <ffff880465439de0>
>> CR2: 0000000000000010
>> BUG: unable to handle kernel ---[ end trace c9a1ca82e01a4294 ]---
>> Kernel panic - not syncing: Fatal exception
>> NULL pointer dereference at 0000000000000014
>> IP: [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
>> PGD 46661d067 PUD 465540067 PMD 0
>> Oops: 0000 [#2] SMP
>> Modules linked in: sg e1000e igb iTCO_wdt iTCO_vendor_support coretemp kvm_intel ptp kvm pps_core i7300_edac edac_core ses lpc_ich enclosure ioatdma mfd_core shpchp dca i2c_i801 serio_raw acpi_cpufreq pcspkr mperf nfsd auth_rpcgss nfs_acl lockd sunrpc xfs libcrc32c sr_mod cdrom sd_mod ata_generic crc_t10dif crct10dif_common pata_acpi radeon i2c_algo_bit drm_kms_helper ttm drm ata_piix libata i2c_core megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
>> CPU: 6 PID: 506 Comm: kworker/6:2 Tainted: G D -------------- 3.10.0-121.el7.x86_64 #1
>> Hardware name: Intel MP Server/S7000FC4UR, BIOS SFC4UR.86B.01.00.0031.021820102042 02/18/2010
>> Workqueue: events od_dbs_timer
>> task: ffff880465448b60 ti: ffff880463730000 task.ti: ffff880463730000
>> RIP: 0010:[<ffffffff81489be4>] [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
>> RSP: 0018:ffff880463731de0 EFLAGS: 00010282
>> RAX: 0000000000000000 RBX: 0000000000010de0 RCX: dead000000200200
>> RDX: 0000000000000001 RSI: 7040000000000000 RDI: ffff88046f2d0e08
>> RBP: ffff880463731e18 R08: ffff88046f2d0e10 R09: dfedcb50370d0e08
>> R10: dfedcb50370d0e08 R11: 0000000000000000 R12: ffff880465bd2b80
>> R13: ffff88046f2d3e00 R14: ffff88046f2d0e08 R15: 0000000000000180
>> FS: 0000000000000000(0000) GS:ffff88046f2c0000(0000) knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
>> CR2: 0000000000000014 CR3: 0000000466be5000 CR4: 00000000000007e0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> Stack:
>> ffff88045ef8d140 ffff880468dcf1c0 ffff88046f2d0e08 ffff880465bd2b80
>> ffff88046f2d3e00 ffff88046f2d7f00 0000000000000180 ffff880463731e60
>> ffffffff8107e02b 000000006f2d3e18 0000000000000000 ffff88046f2d3e18
>> Call Trace:
>> [<ffffffff8107e02b>] process_one_work+0x17b/0x460
>> [<ffffffff8107edfb>] worker_thread+0x11b/0x400
>> [<ffffffff8107ece0>] ? rescuer_thread+0x400/0x400
>> [<ffffffff81085aef>] kthread+0xcf/0xe0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> [<ffffffff815fc76c>] ret_from_fork+0x7c/0xb0
>> [<ffffffff81085a20>] ? kthread_create_on_node+0x140/0x140
>> Code: 1f 84 00 00 00 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 56 49 89 fe 41 55 41 54 53 48 c7 c3 e0 0d 01 00 48 83 ec 10 48 8b 47 f8 <8b> 50 14 4c 8b 68 40 89 d1 4d 8b 7d 10 89 55 cc 48 03 1c cd 80
>> RIP [<ffffffff81489be4>] od_dbs_timer+0x24/0x160
>> RSP <ffff880463731de0>
>> CR2: 0000000000000014
>> drm_kms_helper: panic occurred, switching back to text console
>>
>> There are other panics associated with this, including list corruption and
>> various flavors of panics in the od_dbs_timer & work code.
>>
>> As a test I introduced a separate mutex around the cpufreq_set_policy() call.
>> After doing so system behaves as expected, and the panics no longer happen.
>> I narrowed the mutex down to the "governor switch" code in
>> cpufreq_set_policy() and saw that everything stabilized.
>>
>> I then reverted commit 955ef4833574636819cd269cfbae12f79cbde63a, cpufreq: Drop
>> rwsem lock around CPUFREQ_GOV_POLICY_EXIT, which added the dropping of the
>> policy->rwsem and again, the panics no longer occurred.
>>
>> The closer I looked at commit 955ef483, the more questions I have. The first
>> thing is that it appears that the stacktrace includes function calls that
>> are not, and never have been, part of the linux.git tree, ie) the call trace
>> shows
>>
>> [<c0055a79>] lock_acquire+0x61/0xbc
>> [<c00fabf1>] sysfs_addrm_finish+0xc1/0x128
>> [<c00f9819>] sysfs_hash_and_remove+0x35/0x64
>> [<c00fbe6f>] remove_files.isra.0+0x1b/0x24
>> [<c00fbea5>] sysfs_remove_group+0x2d/0xa8
>> [<c02f9a0b>] cpufreq_governor_interactive+0x13b/0x35c
>> [<c02f61df>] __cpufreq_governor+0x2b/0x8c
>> [<c02f6579>] __cpufreq_set_policy+0xa9/0xf8
>> [<c02f6b75>] store_scaling_governor+0x61/0x100
>> [<c02f6f4d>] store+0x39/0x60
>> [<c00f9b81>] sysfs_write_file+0xed/0x114
>> [<c00b3fd1>] vfs_write+0x65/0xd8
>> [<c00b424b>] sys_write+0x2f/0x50
>> [<c000cdc1>] ret_fast_syscall+0x1/0x52
>>
>> and the top part of the stack from cpufreq_governor_interactive() appear to
>> be for a driver that is not in the linux.git tree. Google search does show
>> that it exists in various android trees, however, my concern is that the "core"
>> scaling governors are now broken because of an out tree driver.
>>
>> It seems like the patch fixed this problem incorrectly by breaking the existing
>> locking. The whole point of the policy->rwsem is to single thread updates
>> to the policy struct; this is no longer the case after that commit as we
>> are breaking the locking right in the middle of a policy->governor update.
>> The fix should have been to the cpufreq_governor_interactive out-of-tree
>> driver and fix the locking there in its release code. In any case, the current
>> linux.git code no longer can reproduce the original failure; the locking in the
>> sysfs release code has changed.
>>
>> In case anyone is interested in reproducing, the reproducer I used was
>>
>> i=0
>> while [ True ]; do
>> i=$((i+1))
>> echo "ondemand" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
>> echo "performance" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
>> echo "ondemand" > /sys/devices/system/cpu/cpu1/cpufreq/scaling_governor &
>> echo "performance" > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor &
>> if [ $((i % 100)) = 0 ]; then
>> num_ps=$(ps aux | grep doit | wc -l)
>> echo "$i running, $num_ps outstanding"
>> fi
>> done
>>
>> and the system usually panics within 500 iterations.
>>
>> I have tested this extensively across 15 systems of various configurations.
>> In the unpatched kernel I see oopses usually within 500 iterations. In the
>> patched kernel which fixes the broken locking, I can go several hours
>> without any issues.
>>
>> This patch effectively reverts commit 955ef483.
>
> OK, I'm convinced by this.
>
> I suppose we should push it for -stable from 3.10 through 3.15.x, right?

Rafael, I think that is a good idea. I'm not sure what the protocol is for
adding stable@xxxxxxxxxx though ...

P.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/