Re: Possible race at user mode helper in request_firmware
From: Ming Lei
Date: Wed Mar 16 2016 - 06:25:45 EST
On Wed, 16 Mar 2016 15:21:24 +0530
Chandra Sekhar Lingutla <clingutla@xxxxxxxxxxxxxx> wrote:
> Hi Ming,
>
> On 03/16/2016 02:34 PM, Chandra Sekhar Lingutla wrote:
> > On 03/16/2016 02:28 PM, Ming Lei wrote:
> >> On Wed, Mar 16, 2016 at 4:48 PM, Chandra Sekhar Lingutla
> >> <clingutla@xxxxxxxxxxxxxx> wrote:
> >>> On 03/16/2016 02:12 PM, Ming Lei wrote:
> >>>>
> >>>> On Wed, Mar 16, 2016 at 3:02 PM, Chandra Sekhar Lingutla
> >>>> <clingutla@xxxxxxxxxxxxxx> wrote:
> >>>>>
> >>>>> Hi Ming,
> >>>>>
> >>>>>
> >>>>> On 03/16/2016 07:36 AM, Ming Lei wrote:
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Mar 15, 2016 at 11:09 PM, Ming Lei <ming.lei@xxxxxxxxxxxxx>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On Mon, Mar 14, 2016 at 8:00 PM, <clingutla@xxxxxxxxxxxxxx> wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Hello,
> >>>>>>>>
> >>>>>>>> I see possible race in _request_firmware_load function, on which I
> >>>>>>>> wanted to
> >>>>>>>> take your opinion.
> >>>>>>>>
> >>>>>>>> When system is going to low power mode, device_cache_fw_images() is
> >>>>>>>> called
> >>>>>>>> from pm notifier which schedules asynchronous workers to cache devices
> >>>>>>>> firmware.
> >>>>>>>> If more than 2 async requests falls under user helper mode, then there
> >>>>>>>> is a
> >>>>>>>> use after free of "firmware" directory kobject.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> If two requests are for getting same firmware image, only one can be
> >>>>>>> sent
> >>>>>>> out and be completed with user helper, please see
> >>>>>>> fw_lookup_and_allocate_buf(),
> >>>>>>> so the race shouldn't have been triggered in this case.
> >>>>>>>
> >>>>>>> So I guess your test must be involved with at least two different
> >>>>>>> requests.
> >>>>>>>
> >>>>>>>>
> >>>>>>>> The race will hit, when one asynchronous worker calls device_del()
> >>>>>>>> while
> >>>>>>>> other asynchronous worker calls get_device_parent() in
> >>>>>>>> _request_firmware_load().
> >>>>>>>> After loading firmware, first worker calls device_del(), and
> >>>>>>>> "firmware"
> >>>>>>>> directory ref count is 1, so cleanup_glue_dir() calls kobject_put()
> >>>>>>>> with
> >>>>>>>> gdp_mutex lock held.
> >>>>>>>> Meanwhile, second async worker calls device_add(), and
> >>>>>>>> get_device_parent()
> >>>>>>>> is still able to find the firmware directory kobject in ksets after
> >>>>>>>> getting
> >>>>>>>> gdp_mutex.
> >>>>>>>> This kobject gets added as parent to the second device kobject.
> >>>>>>>> By the time of accessing the parent kobject, its ref count was 0 and
> >>>>>>>> prints
> >>>>>>>> below warning stack followed by crash.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> The trick here is just that we don't use 'struct device' in
> >>>>>>> device_cache_fw_images(),
> >>>>>>> then the firmware device's parent is null during calling device_add()
> >>>>>>> from
> >>>>>>> _request_firmware_load(), but this should be allowed by driver core.
> >>>>>>>
> >>>>>>> From firmware class view, both device_add()/device_del() is run with
> >>>>>>> pair,
> >>>>>>> and driver core still complains during get_device_parent(), so
> >>>>>>> something
> >>>>>>> inside driver core should be be wrong, IMO.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Thinking about the issue further, I believe it is a false positive which
> >>>>>> is
> >>>>>> caused by CONFIG_DEBUG_KOBJECT_RELEASE, follows the story:
> >>>>>>
> >>>>>> 1) device_add() path
> >>>>>>
> >>>>>> get_device_parent()
> >>>>>> ......
> >>>>>> mutex_lock(&gdp_mutex);
> >>>>>>
> >>>>>> /* find our class-directory at the parent and
> >>>>>> reference
> >>>>>> it */
> >>>>>> spin_lock(&dev->class->p->glue_dirs.list_lock);
> >>>>>> list_for_each_entry(k, &dev->class->p->glue_dirs.list,
> >>>>>> entry)
> >>>>>> if (k->parent == parent_kobj) {
> >>>>>> kobj = kobject_get(k);
> >>>>>> break;
> >>>>>> }
> >>>>>> spin_unlock(&dev->class->p->glue_dirs.list_lock);
> >>>>>> if (kobj) {
> >>>>>> mutex_unlock(&gdp_mutex);
> >>>>>> return kobj;
> >>>>>> }
> >>>>>> k = class_dir_create_and_add(dev->class, parent_kobj);
> >>>>>> mutex_unlock(&gdp_mutex);
> >>>>>> ......
> >>>>>>
> >>>>>> 2) device_del() path
> >>>>>>
> >>>>>> cleanup_device_parent()
> >>>>>> ->cleanup_glue_dir()
> >>>>>> ->mutex_lock(&gdp_mutex);
> >>>>>> ->kobject_put(glue_dir);
> >>>>>> ->kobject_cleanup()
> >>>>>> ->kobject_del()
> >>>>>> ->kobj_kset_leave()
> >>>>>> ->mutex_unlock(&gdp_mutex);
> >>>>>>
> >>>>>> So gdp_mutex is held wrt. both looking up/adding the parent kobject
> >>>>>> and removing the parent kobject. Actually with this lock, both glur_dir
> >>>>>> kobject's referece counting and joining/leaving kset are run atomically.
> >>>>>>
> >>>>>> The race you found is triggered by CONFIG_DEBUG_KOBJECT_RELEASE,
> >>>>>> which causes kobject_cleanup() to be run in workqueue context and the
> >>>>>> lock of gdp_mutex can't be hold in that situation anymore. That is the
> >>>>>> root
> >>>>>> cause for your race.
> >>>>>>
> >>>>>> That also said DEBUG_KOBJECT_RELEASE might cause trouble in
> >>>>>> some situations.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Ming
> >>>>>>
> >>>>> I agree that, CONFIG_DEBUG_KOBJECT_RELEASE can cause the race, as it
> >>>>> delays
> >>>>> call to kobject_cleanup() which deletes kobj entry in glue_dir list.
> >>>>>
> >>>>> I see the race even after disabling CONFIG_DEBUG_KOBJECT_RELEASE.
> >>>>>
> >>>>> The race is between kobject_put() and kobject_get() for the "firmware"
> >>>>> directory.
> >>>>>
> >>>>> In device_del() path, I see the race after the call of
> >>>>> cleanup_device_parent().
> >>>>> cleanup_device_parent() calls kobject_put(glue_dir) with gdp_mutex lock
> >>>>> holding,
> >>>>> So ideally, to call kobject_put/kobject_get(glue_dir) we need to acquire
> >>>>> gdp_mutex.
> >>>>>
> >>>>> but kobject_del(kobj) will not hold the gdp_mutex lock while calling
> >>>>> kobject_put(kobj->parent).
> >>>>>
> >>>>> Here is the race:
> >>>>>
> >>>>> Consider 2 async works calls request_firmware() for 2 different binaries
> >>>>> (fw1.bin, fw2.bin)
> >>>>> Async work1 got scheduled first, and it falls in user helper mode so it
> >>>>> creates "firmware" dir
> >>>>> and then creates "fw1.bin" directory to load data from user.
> >>>>> After load completion or timeout, it starts cleanup dir's, meantime async
> >>>>> work2 got scheduled and
> >>>>> it also falls in user helper mode, and story starts here:
> >>>>>
> >>>>> PATH1 req fw for "fw1.bin"
> >>>>> PATH2
> >>>>> req fw for "fw2.bin"
> >>>>> _request_firmware_load()
> >>>>> device_del(dev1)
> >>>>>
> >>>>> ->cleanup_device_parent()
> >>>>>
> >>>>> mutex_lock(&gdp_mutex);
> >>>>>
> >>>>> _request_firmware_load("fw2.bin")
> >>>>> kobject_put(kobj->parent); --> firmware ref ->1
> >>>>> ->device_add(dev2)
> >>>>> mutex_unlock(&gdp_mutex);
> >>>>> ->get_device_parent()
> >>>>> ->kobject_del()
> >>>>> mutex_lock(&gdp_mutex)
> >>>>> kobj_kset_leave(kobj); -->del "fw1.bin" kobj in kset
> >>>>> list_for_each_entry(k, &dev->class->p->glue_dirs.list, entry)
> >>>>> kobject_put(kobj->parent); /* No lock :)*/
> >>>>> if (k->parent == parent_kobj) {
> >>>>>
> >>>>> ->kref_put() --> firmware ref ->0
> >>>>> kobj = kobject_get(k); -> k is "firmware"
> >>>>> dir
> >>>>> ->kboject_release()
> >>>>> break;
> >>>>>
> >>>>> ->kobject_del()
> >>>>> }
> >>>>>
> >>>>> /*here deletes firmware entry in kset*/
> >>>>> mutex_unlock(&gdp_mutex)
> >>>>>
> >>>>> ->dev2->kobj.parent = kobj -> kobj is released already
> >>>>>
> >>>>>
> >>>>>
> >>>>> PATH 1 is started removing the kobject of "fw1.bin" and its parent
> >>>>> "firmware"
> >>>>> .
> >>>>> cleanup_device_parent(dev1) makes firmware ref count to 1, by calling
> >>>>> kobject_put(firmware) with gdp_mutex held.
> >>>>>
> >>>>> and kobject_del(dev1->kobj) will again calls kobject_put(kobj->parent)
> >>>>> without holding gdp_mutex.
> >>>>>
> >>>>> In PATH2, glue_dir is able to find "firmware" kobj, as firmware dir kobj
> >>>>> is
> >>>>> deleted from the glue_dir list at kboject_release().
> >>>>> and it calls kobject_get(firmware), by the time PATH1 calls
> >>>>> kobject_release(firmware).
> >>>>>
> >>>>> So PATH2 got reference of freed firmware kobj and added as parent to
> >>>>> "fw2.bin" kobj.
> >>>>>
> >>>>> I think, here is possible fix for this race:
> >>>>>
> >>>>> ------------------------8<-------------------------------------
> >>>>> Signed-off-by: Lingutla Chandrasekhar <clingutla@xxxxxxxxxxxxxx>
> >>>>>
> >>>>> diff --git a/drivers/base/core.c b/drivers/base/core.c
> >>>>> index 5d8b713..47bc082 100644
> >>>>> --- a/drivers/base/core.c
> >>>>> +++ b/drivers/base/core.c
> >>>>> @@ -1224,7 +1224,13 @@ void device_del(struct device *dev)
> >>>>> BUS_NOTIFY_REMOVED_DEVICE,
> >>>>> dev);
> >>>>> kobject_uevent(&dev->kobj, KOBJ_REMOVE);
> >>>>> cleanup_device_parent(dev);
> >>>>> - kobject_del(&dev->kobj);
> >>>>> +
> >>>>> + if (dev->kobj.parent->kset == &dev->class->p->glue_dirs) {
> >>>>> + mutex_lock(&gdp_mutex);
> >>>>> + kobject_del(&dev->kobj);
> >>>>> + mutex_unlock(&gdp_mutex);
> >>>>> + } else
> >>>>> + kobject_del(&dev->kobj);
> >>>>> put_device(parent);
> >>>>> }
> >>>>> EXPORT_SYMBOL_GPL(device_del);
> >>>>>
> >>>>> --------------------------->8-----------------------------------
> >>>>
> >>>>
> >>>> OK, one simpler way is to move cleanup_device_parent(dev)
> >>>> after kobject_del(&dev->kobj), which looks more clean too.
> >>>>
> >>> No, We can't move that, bcz kobject_del(&dev->kobj) will clear the parent.
> >>
> >> Yeah, but I think it is better to do that and just call
> >> cleanup_glue_dir() directly
> >> because cleanup_device_parent() should have released the glue_dir.
> >>
> Sorry, i am not able understand correctly, If we reorder kobje_del and cleanup_device_parent(),
> kobject_del() clears kobj->parent, and cleanup_device_parent() gets glue_dir with dev->kobj->parent(),
> cleanup_glue_dir() returns without calling kobject_put() as glue_dir is NULL, so firmware dir becomes dangling kobj.
> Am I missing something here?
I mean something like below, and what do you think about it?
---
diff --git a/drivers/base/core.c b/drivers/base/core.c
index 0a8bdad..3f2e1f8 100644
--- a/drivers/base/core.c
+++ b/drivers/base/core.c
@@ -836,11 +836,31 @@ static struct kobject *get_device_parent(struct device *dev,
return NULL;
}
+static inline bool live_in_glue_dir(struct kobject *kobj,
+ struct device *dev)
+{
+ if (!kobj || !dev->class ||
+ kobj->kset != &dev->class->p->glue_dirs)
+ return true;
+ return false;
+}
+
+static inline struct kobject *get_glue_dir(struct device *dev)
+{
+ if (live_in_glue_dir(&dev->kobj, dev))
+ return dev->kobj.parent;
+ return NULL;
+}
+
+/*
+ * make sure cleaning up dir as the last step, we need to make
+ * sure .release handler of kobject is run with holding the
+ * global lock
+ */
static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
{
/* see if we live in a "glue" directory */
- if (!glue_dir || !dev->class ||
- glue_dir->kset != &dev->class->p->glue_dirs)
+ if (!live_in_glue_dir(glue_dir, dev))
return;
mutex_lock(&gdp_mutex);
@@ -848,11 +868,6 @@ static void cleanup_glue_dir(struct device *dev, struct kobject *glue_dir)
mutex_unlock(&gdp_mutex);
}
-static void cleanup_device_parent(struct device *dev)
-{
- cleanup_glue_dir(dev, dev->kobj.parent);
-}
-
static int device_add_class_symlinks(struct device *dev)
{
struct device_node *of_node = dev_of_node(dev);
@@ -1028,6 +1043,7 @@ int device_add(struct device *dev)
struct kobject *kobj;
struct class_interface *class_intf;
int error = -EINVAL;
+ struct kobject *glue_dir = NULL;
dev = get_device(dev);
if (!dev)
@@ -1072,8 +1088,10 @@ int device_add(struct device *dev)
/* first, register with generic layer. */
/* we require the name to be set before, and pass NULL */
error = kobject_add(&dev->kobj, dev->kobj.parent, NULL);
- if (error)
+ if (error) {
+ glue_dir = get_glue_dir(dev);
goto Error;
+ }
/* notify platform of device entry */
if (platform_notify)
@@ -1154,9 +1172,10 @@ done:
device_remove_file(dev, &dev_attr_uevent);
attrError:
kobject_uevent(&dev->kobj, KOBJ_REMOVE);
+ glue_dir = get_glue_dir(dev);
kobject_del(&dev->kobj);
Error:
- cleanup_device_parent(dev);
+ cleanup_glue_dir(dev, glue_dir);
put_device(parent);
name_error:
kfree(dev->p);
@@ -1232,6 +1251,7 @@ EXPORT_SYMBOL_GPL(put_device);
void device_del(struct device *dev)
{
struct device *parent = dev->parent;
+ struct kobject *glue_dir = NULL;
struct class_interface *class_intf;
/* Notify clients of device removal. This call must come
@@ -1276,8 +1296,9 @@ void device_del(struct device *dev)
blocking_notifier_call_chain(&dev->bus->p->bus_notifier,
BUS_NOTIFY_REMOVED_DEVICE, dev);
kobject_uevent(&dev->kobj, KOBJ_REMOVE);
- cleanup_device_parent(dev);
+ glue_dir = get_glue_dir(dev);
kobject_del(&dev->kobj);
+ cleanup_glue_dir(dev, glue_dir);
put_device(parent);
}
EXPORT_SYMBOL_GPL(device_del);
>
> >> BTW, failure path of device_add() has been broken already.
> >>
> > Yah, forgot to mention that. we need to reorder like above fix, can we?.
> >
> >>>
> >>>
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>>
> >>>>>>>> [21147.455038] ------------[ cut here ]------------
> >>>>>>>> [21147.458910] WARNING: at <...>/kernel/include/linux/kref.h:47
> >>>>>>>> kobject_get+0x50/0x68()
> >>>>>>>> [21147.481199] Modules linked in: core_ctl(PO) qdrbg_module(O)
> >>>>>>>> qcrypto_module(O)
> >>>>>>>> [21147.481256] CPU: 2 PID: 23935 Comm: kworker/u16:8 Tainted: P W O
> >>>>>>>> 3.10.84-g0957845-00427-g56a05c2 #1
> >>>>>>>> [21147.481284] Workqueue: events_unbound async_run_entry_fn
> >>>>>>>> [21147.481296] Call trace:
> >>>>>>>> [21147.481315] [<ffffffc000206b60>] dump_backtrace+0x0/0x270
> >>>>>>>> [21147.481331] [<ffffffc000206de0>] show_stack+0x10/0x1c
> >>>>>>>> [21147.481353] [<ffffffc000d569ec>] dump_stack+0x1c/0x28
> >>>>>>>> [21147.481370] [<ffffffc00021cc18>] warn_slowpath_common+0x74/0x9c
> >>>>>>>> [21147.481383] [<ffffffc00021ce4c>] warn_slowpath_null+0x14/0x20
> >>>>>>>> [21147.481397] [<ffffffc000456b10>] kobject_get+0x4c/0x68
> >>>>>>>> [21147.481416] [<ffffffc000609f64>] get_device_parent+0xa0/0x194
> >>>>>>>> [21147.481430] [<ffffffc00060a3f4>] device_add+0x100/0x600
> >>>>>>>> [21147.481446] [<ffffffc00061c390>] _request_firmware+0x8b4/0xa80
> >>>>>>>> [21147.481459] [<ffffffc00061c58c>] request_firmware+0x30/0x3c
> >>>>>>>> [21147.481473] [<ffffffc00061c5ec>] cache_firmware+0x54/0xb0
> >>>>>>>> [21147.481490] [<ffffffc00061c65c>]
> >>>>>>>> __async_dev_cache_fw_image+0x14/0x54
> >>>>>>>> [21147.481505] [<ffffffc000243794>] async_run_entry_fn+0x6c/0x12c
> >>>>>>>> [21147.481521] [<ffffffc000237e88>] process_one_work+0x264/0x3dc
> >>>>>>>> [21147.481535] [<ffffffc0002392c0>] worker_thread+0x1f0/0x340
> >>>>>>>> [21147.481553] [<ffffffc00023e500>] kthread+0xac/0xb8
> >>>>>>>> [21147.481563] ---[ end trace dabc98ea48b8ba59 ]---
> >>>>>>>>
> >>>>>>>> [21147.486436] Unable to handle kernel paging request at virtual
> >>>>>>>> address
> >>>>>>>> 0x6b6b6b6b6b6b6bcb
> >>>>>>>> [21147.493816] pgd = ffffffc0a982d000
> >>>>>>>> [21147.498337] [6b6b6b6b6b6b6bcb] *pgd=0000000000000000
> >>>>>>>> [21147.502287] Internal error: Oops: 96000004 [#1] PREEMPT SMP
> >>>>>>>> [21147.507826] Modules linked in: core_ctl(PO) qdrbg_module(O)
> >>>>>>>> qcrypto_module(O)
> >>>>>>>> [21147.514951] CPU: 2 PID: 23935 Comm: kworker/u16:8 Tainted: P W O
> >>>>>>>> 3.10.84-g0957845-00427-g56a05c2 #1
> >>>>>>>> [21147.524686] Workqueue: events_unbound async_run_entry_fn
> >>>>>>>> [21147.529965] task: ffffffc008138000 ti: ffffffc007880000 task.ti:
> >>>>>>>> ffffffc007880000
> >>>>>>>> [21147.537438] PC is at sysfs_create_dir+0x34/0xd4
> >>>>>>>> [21147.541949] LR is at kobject_add_internal+0x120/0x24c
> >>>>>>>> [21147.546979] pc : [<ffffffc000361fa8>] lr : [<ffffffc000456f5c>]
> >>>>>>>> pstate:
> >>>>>>>> 80000145
> >>>>>>>> [21147.996295] Process kworker/u16:8 (pid: 23935, stack limit =
> >>>>>>>> 0xffffffc007880058)
> >>>>>>>> [21148.003673] Call trace:
> >>>>>>>> [21148.006117] [<ffffffc000361fa8>] sysfs_create_dir+0x34/0xd4
> >>>>>>>> [21148.011663] [<ffffffc000456f58>] kobject_add_internal+0x11c/0x24c
> >>>>>>>> [21148.017737] [<ffffffc0004573ec>] kobject_add+0xc8/0xe4
> >>>>>>>> [21148.022863] [<ffffffc00060a410>] device_add+0x11c/0x600
> >>>>>>>> [21148.028069] [<ffffffc00061c390>] _request_firmware+0x8b4/0xa80
> >>>>>>>> [21148.033886] [<ffffffc00061c58c>] request_firmware+0x30/0x3c
> >>>>>>>> [21148.039439] [<ffffffc00061c5ec>] cache_firmware+0x54/0xb0
> >>>>>>>> [21148.044822] [<ffffffc00061c65c>]
> >>>>>>>> __async_dev_cache_fw_image+0x14/0x54
> >>>>>>>> [21148.051249] [<ffffffc000243794>] async_run_entry_fn+0x6c/0x12c
> >>>>>>>> [21148.057064] [<ffffffc000237e88>] process_one_work+0x264/0x3dc
> >>>>>>>> [21148.062792] [<ffffffc0002392c0>] worker_thread+0x1f0/0x340
> >>>>>>>> [21148.068263] [<ffffffc00023e500>] kthread+0xac/0xb8
> >>>>>>>> [21148.073039] Code: b5000094 14000026 d000ac54 9125a294 (7940c281)
> >>>>>>>> [21148.080312] ---[ end trace dabc98ea48b8ba5a ]---
> >>>>>>>>
> >>>>>>>> Below is the callstack where the parent is freed from the slub_debug
> >>>>>>>> info.
> >>>>>>>> 0xFFFFFFC000608D74 \\vmlinux\base/core\class_dir_release+0x0C
> >>>>>>>> 0xFFFFFFC0002FE6BC \\vmlinux\slub\free_debug_processing\fail+0x114
> >>>>>>>> 0xFFFFFFC0002FE7A0 \\vmlinux\slub\__slab_free+0x44
> >>>>>>>> 0xFFFFFFC0002FEE60 \\vmlinux\slub\kfree+0x1F8
> >>>>>>>> 0xFFFFFFC000608D70 \\vmlinux\base/core\class_dir_release+0x8
> >>>>>>>> 0xFFFFFFC000456D70 \\vmlinux\kobject\kobject_release+0x134
> >>>>>>>> 0xFFFFFFC000456E10 \\vmlinux\kobject\kobject_put+0x58
> >>>>>>>> 0xFFFFFFC000456C28 \\vmlinux\kobject\kobject_del+0x64
> >>>>>>>> 0xFFFFFFC000609D78 \\vmlinux\base/core\device_del+0x150
> >>>>>>>> 0xFFFFFFC00061C524
> >>>>>>>> \\vmlinux\firmware_class\_request_firmware\out+0x71C
> >>>>>>>> 0xFFFFFFC00061C58C \\vmlinux\firmware_class\request_firmware+0x30
> >>>>>>>> 0xFFFFFFC00061C5EC \\vmlinux\firmware_class\cache_firmware+0x54
> >>>>>>>> 0xFFFFFFC00061C65C
> >>>>>>>> \\vmlinux\firmware_class\__async_dev_cache_fw_image+0x14
> >>>>>>>> 0xFFFFFFC000243794 \\vmlinux\async\async_run_entry_fn+0x6C
> >>>>>>>> 0xFFFFFFC000237E88 \\vmlinux\workqueue\process_one_work+0x264
> >>>>>>>> 0xFFFFFFC0002392C0 \\vmlinux\workqueue\worker_thread\recheck+0x1A0
> >>>>>>>> 0xFFFFFFC00023E500 \\vmlinux\kthread\kthread+0xAC
> >>>>>>>>
> >>>>>>>> For similar type of issue, I see there is an existing fix: "sysfs:
> >>>>>>>> driver
> >>>>>>>> core: Fix glue dir race condition by gdp_mutex".
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/drivers/base/core.c?id=e4a60d139060975eb956717e4f63ae348d4d8cc5
> >>>>>>>>
> >>>>>>>> Still I am able to reproduce the issue. I have verified this race on
> >>>>>>>> kernels: 3.10, 3.18 and 4.4.
> >>>>>>>>
> >>>>>>>> I followed below procedure to reproduce the issue:
> >>>>>>>> 1. Enable "CONFIG_DEBUG_KOBJECT_RELEASE"
> >>>>>>>> 2. Use test_firmware, modified the test for async calls
> >>>>>>>> 3. Replace WARN_ON with BUG_ON in kref_get().
> >>>>>>>> I ran below script from shell:
> >>>>>>>> count=0
> >>>>>>>> while [ 1 ]
> >>>>>>>> do
> >>>>>>>> echo 3 >
> >>>>>>>> /sys/devices/virtual/misc/test_firmware/trigger_ufw
> >>>>>>>> if [ $? -ne 0 ]; then
> >>>>>>>> print "Exiting.. "
> >>>>>>>> exit 1
> >>>>>>>> fi
> >>>>>>>> count=$(($count +1))
> >>>>>>>> echo "count $count"
> >>>>>>>> done
> >>>>>>>>
> >>>>>>>> With below patch, I could reproduce the issue in the 2nd iteration.
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> I will run your test script to see if there is new findings.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -------------------------8<----------------------------------------------
> >>>>>>>>
> >>>>>>>> To reproduce the race condition, use firmware test and
> >>>>>>>> CONFIG_DEBUG_KOBJECT_RELEASE=y, which delays kobject
> >>>>>>>> releases. So that we can catch the bug easily.
> >>>>>>>>
> >>>>>>>> Enabled below flags in kernel config file:
> >>>>>>>>
> >>>>>>>> +CONFIG_DEBUG_KOBJECT_RELEASE=y
> >>>>>>>> +CONFIG_TEST_FIRMWARE=y
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> Looks I can't reproduce the issue with your approach.
> >>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> diff --git a/include/linux/kref.h b/include/linux/kref.h
> >>>>>>>> index 484604d..727fb24 100644
> >>>>>>>> --- a/include/linux/kref.h
> >>>>>>>> +++ b/include/linux/kref.h
> >>>>>>>> @@ -44,7 +44,7 @@ static inline void kref_get(struct kref *kref)
> >>>>>>>> * condition when this kref is freeing by some other thread
> >>>>>>>> right
> >>>>>>>> now.
> >>>>>>>> * In this case one should use kref_get_unless_zero()
> >>>>>>>> */
> >>>>>>>> - WARN_ON_ONCE(atomic_inc_return(&kref->refcount) < 2);
> >>>>>>>> + BUG_ON(atomic_inc_return(&kref->refcount) < 2);
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> /**
> >>>>>>>> diff --git a/lib/test_firmware.c b/lib/test_firmware.c
> >>>>>>>> old mode 100644
> >>>>>>>> new mode 100755
> >>>>>>>> index 86374c1..14c9598
> >>>>>>>> --- a/lib/test_firmware.c
> >>>>>>>> +++ b/lib/test_firmware.c
> >>>>>>>> @@ -18,6 +18,7 @@
> >>>>>>>> #include <linux/miscdevice.h>
> >>>>>>>> #include <linux/slab.h>
> >>>>>>>> #include <linux/uaccess.h>
> >>>>>>>> +#include <linux/async.h>
> >>>>>>>>
> >>>>>>>> +
> >>>>>>>> +static ASYNC_DOMAIN_EXCLUSIVE(ufw_domain);
> >>>>>>>> +
> >>>>>>>> +static void __async_ufw_work(void *name,
> >>>>>>>> + async_cookie_t cookie)
> >>>>>>>> +{
> >>>>>>>> + const struct firmware *fw;
> >>>>>>>> + const char *fw_name = name;
> >>>>>>>> + int ret;
> >>>>>>>> +
> >>>>>>>> + pr_err("requesting %s\n", fw_name);
> >>>>>>>> +
> >>>>>>>> + ret = request_firmware(&fw, fw_name, NULL);
> >>>>>>>> + pr_err("loaded: %zu\n", fw ? fw->size : 0);
> >>>>>>>> + if (!ret)
> >>>>>>>> + kfree(fw);
> >>>>>>>> +
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static char *name[] = { "fw1.bin", "fw2.bin"};
> >>>>>>>> +
> >>>>>>>> +static ssize_t trigger_ufw_store(struct device *dev,
> >>>>>>>> + struct device_attribute *attr,
> >>>>>>>> + const char *buf, size_t count)
> >>>>>>>> +{
> >>>>>>>> + int rc, i, c=0;
> >>>>>>>> + u32 start;
> >>>>>>>> + char *fw_name;
> >>>>>>>> +
> >>>>>>>> + rc = kstrtou32(buf, 0, &start);
> >>>>>>>> + if (rc){
> >>>>>>>> + pr_err("Invalid option\n");
> >>>>>>>> + return rc;
> >>>>>>>> + }
> >>>>>>>> + pr_err(" no of iterations %d\n", start);
> >>>>>>>> +
> >>>>>>>> + for (i=0; i< start; i++)
> >>>>>>>> + {
> >>>>>>>> + if (c >= 2)
> >>>>>>>> + c = 0;
> >>>>>>>> + fw_name = name[c++];
> >>>>>>>> + async_schedule_domain(__async_ufw_work, (void *)fw_name,
> >>>>>>>> &ufw_domain);
> >>>>>>>> + }
> >>>>>>>> + async_synchronize_full_domain(&ufw_domain);
> >>>>>>>> + return count;
> >>>>>>>> +}
> >>>>>>>>
> >>>>>>>> static DEVICE_ATTR_WO(trigger_request);
> >>>>>>>> +static DEVICE_ATTR_WO(trigger_ufw);
> >>>>>>>>
> >>>>>>>> static int __init test_firmware_init(void)
> >>>>>>>> {
> >>>>>>>> @@ -92,6 +140,12 @@ static int __init test_firmware_init(void)
> >>>>>>>> goto dereg;
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>> + rc = device_create_file(test_fw_misc_device.this_device,
> >>>>>>>> + &dev_attr_trigger_ufw);
> >>>>>>>> + if (rc) {
> >>>>>>>> + pr_err("could not create sysfs interface: %d\n", rc);
> >>>>>>>> + goto dereg;
> >>>>>>>> + }
> >>>>>>>> pr_warn("interface ready\n");
> >>>>>>>>
> >>>>>>>> return 0;
> >>>>>>>> @@ -106,7 +160,9 @@ static void __exit test_firmware_exit(void)
> >>>>>>>> {
> >>>>>>>> release_firmware(test_firmware);
> >>>>>>>> device_remove_file(test_fw_misc_device.this_device,
> >>>>>>>> - &dev_attr_trigger_request);
> >>>>>>>> + &dev_attr_trigger_request);
> >>>>>>>> + device_remove_file(test_fw_misc_device.this_device,
> >>>>>>>> + &dev_attr_trigger_ufw);
> >>>>>>>> misc_deregister(&test_fw_misc_device);
> >>>>>>>> pr_warn("removed interface\n");
> >>>>>>>> }
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> ----------------------------8<--------------------------------------------
> >>>>>>>>
> >>>>>>>> Thanks and regards,
> >>>>>>>> Chandrasekhar L.
> >>>>>>>> --
> >>>>>>>> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a
> >>>>>>>> member of
> >>>>>>>> Code Aurora Forum,
> >>>>>>>> a Linux Foundation Collaborative Project.
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-arm-msm" in
> > the body of a message to majordomo@xxxxxxxxxxxxxxx
> > More majordomo info at http://vger.kernel.org/majordomo-info.html