Re: [REGRESSION] boot regression on linux-next on sc7180 platforms - null pointer dereference on iommu_dma_sync_sg_for_device

From: Neil Armstrong
Date: Wed May 22 2024 - 06:00:42 EST


Hi,

On 14/05/2024 18:41, Nícolas F. R. A. Prado wrote:
Hi,

KernelCI has identified a new boot regression on linux-next. It affects the
following platforms:
* sc7180-trogdor-kingoftown
* sc7180-trogdor-lazor-limozeen

I also see the regression on:
- SM8550-QRD
- SM8560-QRD

reverting commit 8cc3bad9d9d6 ("spi: Remove unneded check for orig_nents") removes the issue.

Thanks for reporting this,
Neil

[ 6.404623] Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c
[ 6.413685] Mem abort info:
[ 6.416574] ESR = 0x0000000096000006
[ 6.420436] EC = 0x25: DABT (current EL), IL = 32 bits
[ 6.425901] SET = 0, FnV = 0
[ 6.429046] EA = 0, S1PTW = 0
[ 6.432293] FSC = 0x06: level 2 translation fault
[ 6.437320] Data abort info:
[ 6.440289] ISV = 0, ISS = 0x00000006, ISS2 = 0x00000000
[ 6.445927] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 6.451121] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 6.456585] user pgtable: 4k pages, 48-bit VAs, pgdp=000000088f68b000
[ 6.463208] [000000000000001c] pgd=080000088f68d003, p4d=080000088f68d003, pud=080000088f68e003, pmd=0000000000000000
[ 6.474108] Internal error: Oops: 0000000096000006 [#1] PREEMPT SMP
[ 6.480542] Modules linked in: ucsi_glink pmic_glink_altmode goodix_berlin_spi(+) nb7vpq904m wcd939x_usbss qcom_battmgr typec_ucsi aux_hpd_bridge goodix_berlin_core crct10dif_ce hci_uart rtc_pm8xxx leds_qcom_lpg led_class_multicolor qcom_pon nvmem_qcom_spmi_sdam sm3_ce qcom_pbs btqca snd_soc_wcd939x snd_soc_sc8280xp snd_soc_wcd939x_sdw phy_qcom_eusb2_repeater snd_soc_qcom_sdw regmap_sdw qcom_spmi_temp_alarm snd_soc_qcom_common btbcm snd_soc_wcd_mbhc sm3 qcom_stats snd_soc_wcd_classh drm_dp_aux_bus sha3_ce gpu_sched sha512_ce sha512_arm64 drm_exec bluetooth qcom_q6v5_pas phy_qcom_qmp_combo qcrypto soundwire_qcom qcom_pil_info snd_soc_lpass_va_macro pinctrl_sm8650_lpass_lpi authenc snd_soc_lpass_tx_macro aux_bridge cfg80211 spi_geni_qcom i2c_qcom_geni snd_soc_lpass_rx_macro rfkill phy_qcom_snps_eusb2 dispcc_sm8650 drm_display_helper pinctrl_lpass_lpi gpi snd_soc_lpass_wsa_macro snd_soc_lpass_macro_common slimbus drm_kms_helper gpucc_sm8650 ipa qcom_q6v5 qrtr libdes phy_qcom_qmp_ufs qcom_sysmon qcom_common
[ 6.480602] qcom_glink_smem
[ 6.571649] soundwire_bus mdt_loader pmic_glink qcom_rng phy_qcom_qmp_pcie llcc_qcom ufs_qcom icc_bwmon typec rmtfs_mem pdr_interface qmi_helpers nvmem_reboot_mode socinfo fuse drm backlight ipv6
[ 6.597201] CPU: 4 PID: 241 Comm: (udev-worker) Tainted: G S 6.9.0-next-20240521 #1
[ 6.606488] Hardware name: Qualcomm Technologies, Inc. SM8650 QRD (DT)
[ 6.613189] pstate: 63400005 (nZCv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--)
[ 6.641597] lr : __dma_sync_sg_for_device+0x3c/0x40
[ 6.646632] sp : ffff800081bf3260
[ 6.660650] x26: ffff59520fbd1c80 x25: 0000000000000000 x24: ffffb46fccd24988
[ 6.660653] x23: ffff595201628410 x22: 0000000000000002 x21: 0000000000000000
[ 6.660655] x20: ffff800081bf33f0 x19: 0000000000000000 x18: 0000000000000001
[ 6.660656] x17: 0000000000000018 x16: 0000000000000100 x15: 0000000000000002
[ 6.688275] x14: 0000000000000001 x13: ffff595200995180 x12: 000000000025a5c8
[ 6.688277] x11: 0000000000000820 x10: 0000000000000001 x9 : ffff59520fbd1c69
[ 6.688279] x8 : ffff595202169704 x7 : 00000000ffffffff x6 : 0000000000000001
[ 6.688281] x5 : fffffdffbf7a8cc0 x4 : ffffb46fcc0232a4 x3 : 0000000000000002
[ 6.688283] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff595201628410
[ 6.688286] Call trace:
[ 6.688287] iommu_dma_sync_sg_for_device+0x28/0x100
[ 6.717582] __dma_sync_sg_for_device+0x3c/0x40
[ 6.717585] spi_transfer_one_message+0x358/0x680
[ 6.732229] __spi_pump_transfer_message+0x188/0x494
[ 6.732232] __spi_sync+0x2a8/0x3c4
[ 6.732234] spi_sync+0x30/0x54
[ 6.732236] goodix_berlin_spi_write+0xf8/0x164 [goodix_berlin_spi]
[ 6.739854] _regmap_raw_write_impl+0x538/0x674
[ 6.750053] _regmap_raw_write+0xb4/0x144
[ 6.750056] regmap_raw_write+0x7c/0xc0
[ 6.750058] goodix_berlin_power_on+0xb0/0x1b0 [goodix_berlin_core]
[ 6.765520] goodix_berlin_probe+0xc0/0x660 [goodix_berlin_core]
[ 6.765522] goodix_berlin_spi_probe+0x12c/0x14c [goodix_berlin_spi]
[ 6.772339] spi_probe+0x84/0xe4
[ 6.772342] really_probe+0xbc/0x29c
[ 6.784313] __driver_probe_device+0x78/0x12c
[ 6.784316] driver_probe_device+0x3c/0x15c
[ 6.784319] __driver_attach+0x90/0x19c
[ 6.784322] bus_for_each_dev+0x7c/0xdc
[ 6.794520] driver_attach+0x24/0x30
[ 6.794523] bus_add_driver+0xe4/0x208
[ 6.794526] driver_register+0x5c/0x124
[ 6.802586] __spi_register_driver+0xa4/0xe4
[ 6.802589] goodix_berlin_spi_driver_init+0x20/0x1000 [goodix_berlin_spi]
[ 6.802591] do_one_initcall+0x80/0x1c8
[ 6.902310] do_init_module+0x60/0x218
[ 6.921988] load_module+0x1bcc/0x1d8c
[ 6.925847] init_module_from_file+0x88/0xcc
[ 6.930238] __arm64_sys_finit_module+0x1dc/0x2e4
[ 6.935074] invoke_syscall+0x48/0x114
[ 6.938944] el0_svc_common.constprop.0+0xc0/0xe0
[ 6.943781] do_el0_svc+0x1c/0x28
[ 6.947195] el0_svc+0x34/0xd8
[ 6.950348] el0t_64_sync_handler+0x120/0x12c
[ 6.954833] el0t_64_sync+0x190/0x194
[ 6.958600] Code: 2a0203f5 2a0303f6 a90363f7 aa0003f7 (b9401c20)
[ 6.964859] ---[ end trace 0000000000000000 ]---


The regression was introduced in next-20240509, and still affects today's
(next-20240514) release.

The config used was the upstream arm64 defconfig with a config fragment on top
[1].

The following stack traces are produced during boot and a usable shell is never
reached:

[ 0.381981] Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c
[ 0.381989] Mem abort info:
[ 0.381991] ESR = 0x0000000096000004
[ 0.381994] EC = 0x25: DABT (current EL), IL = 32 bits
[ 0.381997] SET = 0, FnV = 0
[ 0.382000] EA = 0, S1PTW = 0
[ 0.382003] FSC = 0x04: level 0 translation fault
[ 0.382006] Data abort info:
[ 0.382008] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 0.382011] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 0.382014] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 0.382017] [000000000000001c] user address but active_mm is swapper
[ 0.382021] Internal error: Oops: 0000000096000004 [#1] PREEMPT SMP
[ 0.382025] Modules linked in:
[ 0.382032] CPU: 4 PID: 68 Comm: kworker/u32:2 Not tainted 6.9.0-next-20240514-dirty #380
[ 0.382038] Hardware name: Google Kingoftown (DT)
[ 0.382042] Workqueue: async async_run_entry_fn
[ 0.382055] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 0.382061] pc : iommu_dma_sync_sg_for_device+0x28/0x100
[ 0.382070] lr : __dma_sync_sg_for_device+0x28/0x4c
[ 0.382080] sp : ffff800080943740
[ 0.382082] x29: ffff800080943740 x28: ffff36ee44280000 x27: ffff36ee40bd7810
[ 0.382092] x26: ffff800080943998 x25: ffff36ee44280480 x24: ffffb54600bcf0e8
[ 0.382101] x23: ffff36ee40bd7810 x22: 0000000000000001 x21: 0000000000000000
[ 0.382110] x20: ffffb54600f3d098 x19: 0000000000000000 x18: ffffb54601c1a210
[ 0.382118] x17: 000000040044ffff x16: 0000000000000000 x15: ffff36efb6d95580
[ 0.382126] x14: ffff36ee409156c0 x13: 0000000000001797 x12: 0000000000000002
[ 0.382134] x11: 0000000000000004 x10: ffff36ee4308b3d8 x9 : ffff36ee44280469
[ 0.382143] x8 : ffff36ee4308b304 x7 : 00000000ffffffff x6 : 0000000000000001
[ 0.382151] x5 : ffffb5460033a740 x4 : ffffb545ff50375c x3 : 0000000000000001
[ 0.382159] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff36ee40bd7810
[ 0.382167] Call trace:
[ 0.382170] iommu_dma_sync_sg_for_device+0x28/0x100
[ 0.382176] __dma_sync_sg_for_device+0x28/0x4c
[ 0.382183] spi_transfer_one_message+0x378/0x6e4
[ 0.382193] __spi_pump_transfer_message+0x190/0x4a4
[ 0.382199] __spi_sync+0x2a0/0x3c4
[ 0.382205] spi_sync_locked+0x10/0x1c
[ 0.382211] tpm_tis_spi_transfer_full+0x160/0x2fc
[ 0.382217] tpm_tis_spi_transfer+0x34/0x40
[ 0.382221] tpm_tis_spi_cr50_read_bytes+0x5c/0x90
[ 0.382226] tpm_tis_core_init+0xfc/0x7e0
[ 0.382231] tpm_tis_spi_init+0x54/0x70
[ 0.382236] cr50_spi_probe+0xf4/0x27c
[ 0.382241] tpm_tis_spi_driver_probe+0x34/0x64
[ 0.382245] spi_probe+0x84/0xe4
[ 0.382251] really_probe+0xbc/0x2a0
[ 0.382258] __driver_probe_device+0x78/0x12c
[ 0.382264] driver_probe_device+0x40/0x160
[ 0.382269] __device_attach_driver+0xb8/0x134
[ 0.382275] bus_for_each_drv+0x84/0xe0
[ 0.382280] __device_attach_async_helper+0xac/0xd0
[ 0.382286] async_run_entry_fn+0x34/0xe0
[ 0.382291] process_one_work+0x154/0x298
[ 0.382300] worker_thread+0x304/0x408
[ 0.382307] kthread+0x118/0x11c
[ 0.382313] ret_from_fork+0x10/0x20
[ 0.382324] Code: 2a0203f5 2a0303f6 a90363f7 aa0003f7 (b9401c20)
[ 0.382328] ---[ end trace 0000000000000000 ]---

[ 0.393379] spi_master spi6: will run message pump with realtime priority
[ 0.393896] Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c
[ 0.393903] Mem abort info:
[ 0.393905] ESR = 0x0000000096000004
[ 0.393908] EC = 0x25: DABT (current EL), IL = 32 bits
[ 0.393912] SET = 0, FnV = 0
[ 0.393915] EA = 0, S1PTW = 0
[ 0.393917] FSC = 0x04: level 0 translation fault
[ 0.393920] Data abort info:
[ 0.393922] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 0.393925] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 0.393928] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 0.393931] [000000000000001c] user address but active_mm is swapper
[ 0.393935] Internal error: Oops: 0000000096000004 [#2] PREEMPT SMP
[ 0.393939] Modules linked in:
[ 0.393946] CPU: 2 PID: 103 Comm: cros_ec_spi_hig Tainted: G D 6.9.0-next-20240514-dirty #380
[ 0.393953] Hardware name: Google Kingoftown (DT)
[ 0.393956] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 0.393962] pc : iommu_dma_sync_sg_for_device+0x28/0x100
[ 0.393975] lr : __dma_sync_sg_for_device+0x28/0x4c
[ 0.393985] sp : ffff800080de3aa0
[ 0.393988] x29: ffff800080de3aa0 x28: ffff36ee44281800 x27: ffff36ee40ff8010
[ 0.393997] x26: ffff800080de3cf8 x25: ffff36ee44281c80 x24: ffffb54600bcf0e8
[ 0.394006] x23: ffff36ee40ff8010 x22: 0000000000000001 x21: 0000000000000000
[ 0.394014] x20: ffffb54600f3d3d8 x19: 0000000000000000 x18: ffffb54601c1a210
[ 0.394023] x17: 0000000000010108 x16: 0000000000000000 x15: 000000000000000c
[ 0.394031] x14: 0000000000000000 x13: ffff36ee40b962b0 x12: 0000000000000000
[ 0.394039] x11: 0000000000000000 x10: 0000000000003fff x9 : ffff36ee44281c69
[ 0.394047] x8 : ffff36ee4103e704 x7 : 00000000ffffffff x6 : 0000000000000001
[ 0.394055] x5 : ffffb5460033a740 x4 : ffffb545ff50375c x3 : 0000000000000001
[ 0.394063] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff36ee40ff8010
[ 0.394071] Call trace:
[ 0.394074] iommu_dma_sync_sg_for_device+0x28/0x100
[ 0.394081] __dma_sync_sg_for_device+0x28/0x4c
[ 0.394088] spi_transfer_one_message+0x378/0x6e4
[ 0.394096] __spi_pump_transfer_message+0x190/0x4a4
[ 0.394103] __spi_sync+0x2a0/0x3c4
[ 0.394109] spi_sync_locked+0x10/0x1c
[ 0.394115] do_cros_ec_pkt_xfer_spi+0x108/0x530
[ 0.394122] cros_ec_xfer_high_pri_work+0x20/0x34
[ 0.394127] kthread_worker_fn+0xcc/0x184
[ 0.394134] kthread+0x118/0x11c
[ 0.394140] ret_from_fork+0x10/0x20
[ 0.394150] Code: 2a0203f5 2a0303f6 a90363f7 aa0003f7 (b9401c20)
[ 0.394154] ---[ end trace 0000000000000000 ]---

[ 3.654117] Unable to handle kernel NULL pointer dereference at virtual address 000000000000001c
[ 3.663154] Mem abort info:
[ 3.666032] ESR = 0x0000000096000004
[ 3.669943] EC = 0x25: DABT (current EL), IL = 32 bits
[ 3.675417] SET = 0, FnV = 0
[ 3.678563] EA = 0, S1PTW = 0
[ 3.681792] FSC = 0x04: level 0 translation fault
[ 3.686808] Data abort info:
[ 3.689765] ISV = 0, ISS = 0x00000004, ISS2 = 0x00000000
[ 3.695399] CM = 0, WnR = 0, TnD = 0, TagAccess = 0
[ 3.700592] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[ 3.706050] [000000000000001c] user address but active_mm is swapper
[ 3.712576] Internal error: Oops: 0000000096000004 [#3] PREEMPT SMP
[ 3.719017] Modules linked in:
[ 3.722162] CPU: 6 PID: 11 Comm: kworker/u32:0 Tainted: G D 6.9.0-next-20240514-dirty #380
[ 3.732067] Hardware name: Google Kingoftown (DT)
[ 3.736904] Workqueue: events_unbound deferred_probe_work_func
[ 3.742907] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[ 3.750052] pc : iommu_dma_sync_sg_for_device+0x28/0x100
[ 3.755526] lr : __dma_sync_sg_for_device+0x28/0x4c
[ 3.760548] sp : ffff8000800ab0b0
[ 3.763953] x29: ffff8000800ab0b0 x28: ffff36ee43a6a000 x27: ffff36ee41012010
[ 3.771279] x26: ffff8000800ab2e8 x25: ffff36ee43a6a480 x24: ffffb54600bcf0e8
[ 3.778604] x23: ffff36ee41012010 x22: 0000000000000001 x21: 0000000000000000
[ 3.785928] x20: ffffb54600f3d718 x19: 0000000000000000 x18: ffffb54601c19c48
[ 3.793258] x17: 0000000000010108 x16: 0000000000000000 x15: 000000000000000c
[ 3.800589] x14: 0000000000000000 x13: ffff36ee40b962b0 x12: 0000000000000000
[ 3.807921] x11: 071c71c71c71c71c x10: 0000000000003fff x9 : ffff36ee43a6a469
[ 3.815254] x8 : ffff36ee4101cf04 x7 : 00000000ffffffff x6 : 0000000000000001
[ 3.822587] x5 : ffffb5460033a740 x4 : ffffb545ff50375c x3 : 0000000000000001
[ 3.829910] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff36ee41012010
[ 3.837234] Call trace:
[ 3.839750] iommu_dma_sync_sg_for_device+0x28/0x100
[ 3.844853] __dma_sync_sg_for_device+0x28/0x4c
[ 3.849517] spi_transfer_one_message+0x378/0x6e4
[ 3.854360] __spi_pump_transfer_message+0x190/0x4a4
[ 3.859462] __spi_sync+0x2a0/0x3c4
[ 3.863048] spi_sync+0x30/0x54
[ 3.866283] spi_mem_exec_op+0x26c/0x41c
[ 3.870321] spi_nor_read_id+0x7c/0xc4
[ 3.874180] spi_nor_detect+0x34/0x158
[ 3.878039] spi_nor_scan+0x1f0/0xef8
[ 3.881813] spi_nor_probe+0x94/0x2ec
[ 3.885587] spi_mem_probe+0x6c/0xac
[ 3.889262] spi_probe+0x84/0xe4
[ 3.892579] really_probe+0xbc/0x2a0
[ 3.896262] __driver_probe_device+0x78/0x12c
[ 3.900747] driver_probe_device+0x40/0x160
[ 3.905046] __device_attach_driver+0xb8/0x134
[ 3.909619] bus_for_each_drv+0x84/0xe0
[ 3.913568] __device_attach+0xa8/0x1b0
[ 3.917515] device_initial_probe+0x14/0x20
[ 3.921814] bus_probe_device+0xa8/0xac
[ 3.925761] device_add+0x590/0x750
[ 3.929351] __spi_add_device+0x138/0x208
[ 3.933476] of_register_spi_device+0x394/0x57c
[ 3.938139] spi_register_controller+0x394/0x760
[ 3.942888] qcom_qspi_probe+0x328/0x390
[ 3.946928] platform_probe+0x68/0xd8
[ 3.950701] really_probe+0xbc/0x2a0
[ 3.954384] __driver_probe_device+0x78/0x12c
[ 3.958869] driver_probe_device+0x40/0x160
[ 3.963169] __device_attach_driver+0xb8/0x134
[ 3.967734] bus_for_each_drv+0x84/0xe0
[ 3.971682] __device_attach+0xa8/0x1b0
[ 3.975628] device_initial_probe+0x14/0x20
[ 3.979927] bus_probe_device+0xa8/0xac
[ 3.983873] deferred_probe_work_func+0x88/0xc0
[ 3.988536] process_one_work+0x154/0x298
[ 3.992663] worker_thread+0x304/0x408
[ 3.996525] kthread+0x118/0x11c
[ 3.999847] ret_from_fork+0x10/0x20
[ 4.003534] Code: 2a0203f5 2a0303f6 a90363f7 aa0003f7 (b9401c20)
[ 4.009788] ---[ end trace 0000000000000000 ]---

Searching on lore I could only find the following series that caused another
regression, and its subsequent fix:
https://lore.kernel.org/lkml/20240507112026.1803778-1-aleksander.lobakin@xxxxxxxxx/
https://lore.kernel.org/all/20240509144616.938519-1-aleksander.lobakin@xxxxxxxxx/

But even after reverting both the issue was still there, so I've concluded
that's unrelated.

Thanks,
Nícolas

#regzbot introduced: next-20240509

[1] https://pastebin.com/raw/sx4bPAa6