Re: [Letux-kernel] BUG: drivers/pinctrl/core: races in pinctrl_groups and deferred probing

From: H. Nikolaus Schaller
Date: Sat Jun 16 2018 - 07:07:55 EST


Hi Tony,

> Am 15.06.2018 um 13:18 schrieb H. Nikolaus Schaller <hns@xxxxxxxxxxxxx>:
>
> Hi Tony,
>
>> Am 15.06.2018 um 13:13 schrieb Tony Lindgren <tony@xxxxxxxxxxx>:
>>
>> * Tony Lindgren <tony@xxxxxxxxxxx> [180615 07:00]:
>>
>> OK I posted a series to fix these issues hopefully as thread
>> "[PATCH 0/5] pinctrl fixes for generic functions and groups".
>>
>> Can you please test and see if that is enough to fix the issues
>> you're seeing?
>
> Yes, I'll try asap.

Ok, I tried on OMAP3 (GTA04A5 board).

First boot was ok, but second did show again the strcmp(NULL) :(

Interestingly, adding my printk() to pinctrl_generic_add_group makes it disappear.
(I think the printk adds more delay to one of the code paths and avoids the race or
problematic concurrency).

There is also good news: the selectors are now assigned in strict sequence.
This is a big improvement:

root@letux:~# dmesg|fgrep generic_add
[ 0.739501] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_hsusb2_pins selector=0
[ 0.741638] pinctrl_generic_add_group: pctldev=ee481f80 name=pinmux_hsusb2_2_pins selector=0
[ 0.742401] pinctrl_generic_add_group: pctldev=ee481e00 name=pinmux_mcbsp1_devconf0_pins selector=0
[ 0.743072] pinctrl_generic_add_group: pctldev=ee481d00 name=pinmux_tv_acbias_devconf1_pins selector=0
[ 0.760437] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_uart1_pins selector=1
[ 0.762268] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_uart2_pins selector=2
[ 0.763610] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_uart3_pins selector=3
[ 1.790527] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_twl4030_pins selector=4
[ 1.799835] pinctrl_generic_add_group: pctldev=ee3cd000 name=pinmux_twl4030_vpins selector=0
[ 1.969635] pinctrl_generic_add_group: pctldev=ee481f80 name=spi_gpio_pinmux selector=1
[ 2.164093] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_mmc1_pins selector=5
[ 2.198242] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_wlan_irq_pin selector=6
[ 2.374328] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_gpmc_pins selector=7
[ 2.769989] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_wlan_pins selector=8
[ 2.785888] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_bt_pins selector=9
[ 2.802154] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_wlan_irq_pin selector=10
[ 5.415222] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_dss_dpi_pins selector=11
[ 6.080566] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_mcbsp1_pins selector=12
[ 6.182617] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_mcbsp2_pins selector=13
[ 6.271240] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_mcbsp3_pins selector=14
[ 6.330718] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_penirq_pins selector=15
[ 6.368927] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_mcbsp4_pins selector=16
[ 6.616546] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=17
[ 6.835418] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=18
[ 6.874694] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=19
[ 6.900299] pinctrl_generic_add_group: pctldev=ee3cd180 name=hdq_pins selector=20
[ 6.937042] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=21
[ 7.062744] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=22
[ 7.108032] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=23
[ 7.145080] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=24
[ 7.197723] pinctrl_generic_add_group: pctldev=ee3cd180 name=backlight_pins_pinmux selector=25
[ 7.427398] pinctrl_generic_add_group: pctldev=ee3cd180 name=modem_pins selector=26
[ 7.476562] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=27
[ 7.505981] pinctrl_generic_add_group: pctldev=ee3cd180 name=backlight_pins_pinmux selector=28
[ 7.564910] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=29
[ 7.894439] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=30
[ 8.207489] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=31
[ 8.311096] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_pps_pins selector=32
[ 8.340667] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=33
[ 8.414428] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=34
[ 9.231445] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=35
[ 9.446716] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=36
root@letux:~#

But it looks as if we still have duplicate assignments by deferred probing, i.e. some cleanup is
missing (or is this intended behaviour?).

Regarding the strcmp(NULL) faults, it seems as we still have the situation that there are
invalid selector numbers stored somewhere and used by generic driver probing code (i.e. before
the driver is modprobed):

[ 0.000000] Booting Linux on physical CPU 0x0
[ 0.000000] Linux version 4.17.0-letux+ (hns@xxxxxxxxxxxxxx) (gcc version 4.9.2 (GCC)) #2453 SMP PREEMPT Fri Jun 15 20:34:03 CEST 2018
[ 0.000000] CPU: ARMv7 Processor [413fc082] revision 2 (ARMv7), cr=10c5387d
[ 0.000000] CPU: PIPT / VIPT nonaliasing data cache, VIPT aliasing instruction cache
[ 0.000000] OF: fdt: Machine model: Goldelico GTA04A5/Letux 2804

...

[ 6.884918] pinctrl_generic_get_group_name: group>name is NULL
[ 6.933441] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@xxxxxxxx>
[ 6.947204] pinctrl_get_group_selector: strcmp: (null) modem_pins
[ 6.980468] get_group_name = pinctrl_generic_get_group_name+0x0/0x5c
[ 7.009735] pctldev = 0xee396180
[ 7.013336] Unable to handle kernel NULL pointer dereference at virtual address 00000000
[ 7.099395] pgd = (ptrval)
[ 7.102233] [00000000] *pgd=be11a831
[ 7.106018] Internal error: Oops: 17 [#1] PREEMPT SMP ARM
[ 7.111663] Modules linked in: pps_gpio(+) panel_tpo_td028ttec1 snd_soc_simple_card(+) snd_soc_simple_card_utils snd_soc_omap_twl4030(+) pps_core encoder_opa362 wwan_on_off(+) snd_soc_gtm601 pwm_omap_dmtimer connector_analog_tv generic_adc_battery pwm_bl bq27xxx_battery_hdq bq27xxx_battery wlcore_sdio bmp280_spi omap_hdq omap2430 ov9655 v4l2_fwnode v4l2_common snd_soc_omap_mcbsp snd_soc_omap snd_pcm_dmaengine bmp280_i2c bmp280 videodev at24 leds_tca6507 tsc2007 phy_twl4030_usb bmc150_accel_i2c bmc150_magn_i2c media bmc150_accel_core bmc150_magn industrialio_triggered_buffer kfifo_buf musb_hdrc gpio_twl4030 snd_soc_twl4030 twl4030_pwrbutton twl4030_charger twl4030_vibra twl4030_madc industrialio gnss_w2sg0004 ehci_omap w2cbw003_bluetooth gnss omapdss omapdss_base cec
[ 7.182952] CPU: 0 PID: 1198 Comm: udevd Not tainted 4.17.0-letux+ #2453
[ 7.189971] Hardware name: Generic OMAP36xx (Flattened Device Tree)
[ 7.196563] PC is at strcmp+0x0/0x34
[ 7.200317] LR is at pinctrl_get_group_selector+0x84/0xc8
[ 7.205993] pc : [<c0758ec0>] lr : [<c042d9bc>] psr: 600f0013
[ 7.212554] sp : ed019cf8 ip : 00000000 fp : 0000001d
[ 7.218017] r10: 0000001c r9 : 0000001d r8 : 00000000
[ 7.223480] r7 : c0799fac r6 : ee396180 r5 : ef7c4b74 r4 : 0000001a
[ 7.230316] r3 : 00000000 r2 : 00000000 r1 : ef7c4b74 r0 : 00000000
[ 7.237182] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none
[ 7.244659] Control: 10c5387d Table: ad01c019 DAC: 00000051
[ 7.250671] Process udevd (pid: 1198, stack limit = 0x(ptrval))
[ 7.256866] Stack: (0xed019cf8 to 0xed01a000)
[ 7.261444] 9ce0: ed019d1c ee396180
[ 7.270019] 9d00: ef7c4b74 00000000 ed0fc810 00000000 ed20d740 c042e990 ed20d990 00000001
[ 7.278594] 9d20: 00000002 ed20d9c0 ed20d780 ed20d7c0 ed0fc810 00000000 00000000 ed20d740
[ 7.287139] 9d40: 00000000 c042d1a4 00000014 ee23aac0 c0b59e30 c08e9f64 ee23ca10 00000000
[ 7.295715] 9d60: ed20dad0 ee23ca10 c0bd1df0 fffffdfb bf248020 00000030 00000000 c042d36c
[ 7.304290] 9d80: 00000000 ee23ca10 ed20db50 c04f2e70 ee23ca10 00000000 c0bd1df4 c04d3e8c
[ 7.312866] 9da0: ee23ca10 ee23ca44 bf248020 c0b666c8 bf248180 bf2481b0 00000000 c04d413c
[ 7.321441] 9dc0: ee23ca10 bf248020 c04d40bc c04d2620 ee01be58 ee25e034 bf248020 00000000
[ 7.330017] 9de0: ed0fc000 c04d34b8 bf247727 bf247728 00000000 bf248020 c0b98de0 bf24b000
[ 7.338592] 9e00: 00000000 c04d4ce0 c04d5500 bf248080 c0b98de0 c0102d74 014000c0 c07502e0
[ 7.347167] 9e20: bf2480c8 ed22e2c8 ed22e200 c08e85a2 ed1fb080 bf248080 00000001 ed22e2e4
[ 7.355743] 9e40: ed1fb0c0 a0070113 ed1fb100 ed1fb080 ee000000 014000c0 bf248080 ed019f50
[ 7.364318] 9e60: bf24808c bf248080 ed019f50 ed1fb100 00000000 c01a7174 bf248080 ed1fb100
[ 7.372894] 9e80: bf248080 ed019f50 bf24808c c01a5fa4 ffff8000 00007fff bf248080 c01a32d4
[ 7.381469] 9ea0: 00000075 c076a678 f0bb090c b6e119f8 00000000 00000003 ee76e780 0000295c
[ 7.390045] 9ec0: 00000000 c024cf6c 00000003 00000000 00000000 00000000 00000000 00000000
[ 7.398620] 9ee0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
[ 7.407196] 9f00: 00000000 00000000 7fffffff 00000000 b6e119f8 00000006 0000017b c01011e4
[ 7.415771] 9f20: ed018000 00000000 00058378 c01a6328 7fffffff 00000000 00000003 c021707c
[ 7.424316] 9f40: 00000002 f0bae000 0000295c 00000000 f0baebbe f0bae000 0000295c f0bb0394
[ 7.432891] 9f60: f0baf1b7 f0bafa98 00003000 000031b0 00000000 00000000 00000000 00001c1c
[ 7.441467] 9f80: 00000023 00000024 0000001b 00000000 00000017 00000000 b6e127d4 00051f70
[ 7.450042] 9fa0: 17964f00 c0101000 b6e127d4 00051f70 00000006 b6e119f8 00000000 b6e1231c
[ 7.458618] 9fc0: b6e127d4 00051f70 17964f00 0000017b 00020000 00037f78 00050048 00058378
[ 7.467193] 9fe0: bee21ee0 bee21ed0 b6e0bc4b b6f16a42 60070030 00000006 00000000 00000000
[ 7.475769] [<c0758ec0>] (strcmp) from [<c042d9bc>] (pinctrl_get_group_selector+0x84/0xc8)
[ 7.484436] [<c042d9bc>] (pinctrl_get_group_selector) from [<c042e990>] (pinmux_map_to_setting+0x158/0x1a0)
[ 7.494659] [<c042e990>] (pinmux_map_to_setting) from [<c042d1a4>] (create_pinctrl+0x1f0/0x2f8)
[ 7.503784] [<c042d1a4>] (create_pinctrl) from [<c042d36c>] (devm_pinctrl_get+0x2c/0x6c)
[ 7.512268] [<c042d36c>] (devm_pinctrl_get) from [<c04f2e70>] (pinctrl_bind_pins+0x3c/0x138)
[ 7.521118] [<c04f2e70>] (pinctrl_bind_pins) from [<c04d3e8c>] (driver_probe_device+0xe8/0x318)
[ 7.530242] [<c04d3e8c>] (driver_probe_device) from [<c04d413c>] (__driver_attach+0x80/0xa4)
[ 7.539093] [<c04d413c>] (__driver_attach) from [<c04d2620>] (bus_for_each_dev+0x58/0x7c)
[ 7.547668] [<c04d2620>] (bus_for_each_dev) from [<c04d34b8>] (bus_add_driver+0xcc/0x1e0)
[ 7.556243] [<c04d34b8>] (bus_add_driver) from [<c04d4ce0>] (driver_register+0x9c/0xe0)
[ 7.564636] [<c04d4ce0>] (driver_register) from [<c0102d74>] (do_one_initcall+0xb4/0x248)
[ 7.573211] [<c0102d74>] (do_one_initcall) from [<c01a7174>] (do_init_module+0x58/0x1d0)
[ 7.581695] [<c01a7174>] (do_init_module) from [<c01a5fa4>] (load_module+0xe04/0xfb0)
[ 7.589904] [<c01a5fa4>] (load_module) from [<c01a6328>] (sys_finit_module+0x88/0x90)
[ 7.598114] [<c01a6328>] (sys_finit_module) from [<c0101000>] (ret_fast_syscall+0x0/0x54)
[ 7.606689] Exception stack(0xed019fa8 to 0xed019ff0)
[ 7.611999] 9fa0: b6e127d4 00051f70 00000006 b6e119f8 00000000 b6e1231c
[ 7.620574] 9fc0: b6e127d4 00051f70 17964f00 0000017b 00020000 00037f78 00050048 00058378
[ 7.629119] 9fe0: bee21ee0 bee21ed0 b6e0bc4b b6f16a42
[ 7.634429] Code: e3520000 e5e32001 1afffffb e12fff1e (e4d03001)
[ 8.755401] cfg80211: Loading compiled-in X.509 certificates for regulatory database
[ 8.764160] ---[ end trace e60e3511cff37b8a ]---

Can this be a side-effect of the duplicate selector assigments?

Let's assume a first probe attempt does:

[ 6.616546] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=17

This fails and is probed again doing:

[ 6.835418] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=18
[ 6.874694] pinctrl_generic_add_group: pctldev=ee3cd180 name=pinmux_camera_pins selector=19

Then, a different driver does

[ 6.900299] pinctrl_generic_add_group: pctldev=ee3cd180 name=hdq_pins selector=20

And calls pinctrl_get_group_selector() to find itself in the list by pinmux_map_to_setting().

For this the code in pinctrl_get_group_selector() loops over all 21 known selector numbers,
calling pctlops->get_group_name(pctldev, group_selector) for each one.

If slots 17 and 18 do not really exist (or are deleted after failed probing), there is
a NULL return. Which triggers the strcmp(NULL), because the strcmp is not guarded against
non-existing slots in the radix tree.

So a simple workaround could be:

if (gname && !strcmp(gname, pin_group)) {

But I think the fundamental problem is that the same driver assigns multiple slots if
probing is deferred.

Therefore, I tried to find out more why this happens. Here are some more observations by
studying the code and adding a dump_stack() inside pinctrl_generic_add_group():

1. the records stored in the radix tree are allocated through devm_kzalloc() within
pinctrl_generic_add_group().
2. I have not found a mechanism that removes them from the radix tree if they are
devm freed which IMHO happens if the probe fails (and all devm objects are
rewound).
3. I could not observe calls to pinctrl_generic_remove_group()
4. this means a stale pointer to the group_desc is still stored in the radix tree
if driver probing fails
5. scanning through the radix tree for a matching gname in pinctrl_get_group_selector()
accesses this stale radix tree entry.
It has either been overwritten by something else - or contains a dangling pointer for
group->name. This explains the randomness of the problem but that it is also repeatable
to some extent. For a pure race of some 100 instructions it happens IMHO too often.
6. the pinctrl_generic_add_group() is called from create_pinctrl() through
pinctrl_dt_to_map() and pcs_dt_node_to_map(), i.e. before the pinctrl_maps_mutex
is locked in create_pinctrl(). This means that pinctrl_generic_add_group() is
not locked in this case.

I hope this helps to pinpoint and solve the remaining bugs.

BR and thanks,
Nikolaus