在 2024/9/8 3:28, Mårten Lindahl 写道:Hi Zhihao Cheng!
If a reboot/shutdown signal with double force (-ff) is triggered when
the erase worker or wear-leveling worker function runs we may end up in
a race condition since the MTD device gets a reboot notification and
suspends the nand flash before the erase or wear-leveling is done. This
will reject all accesses to the flash with -EBUSY.
Sequence for the erase worker function:
systemctl reboot -ff ubi_thread
do_work
__do_sys_reboot
blocking_notifier_call_chain
mtd_reboot_notifier
nand_shutdown
nand_suspend
__erase_worker
ubi_sync_erase
mtd_erase
nand_erase_nand
# Blocked by suspended chip
nand_get_device
=> EBUSY
Similar sequence for the wear-leveling function:
systemctl reboot -ff ubi_thread
do_work
__do_sys_reboot
blocking_notifier_call_chain
mtd_reboot_notifier
nand_shutdown
nand_suspend
wear_leveling_worker
ubi_eba_copy_leb
ubi_io_write
mtd_write
nand_write_oob
# Blocked by suspended chip
nand_get_device
=> EBUSY
systemd-shutdown[1]: Rebooting.
ubi0 error: ubi_io_write: error -16 while writing 2048 bytes to PEB
CPU: 1 PID: 82 Comm: ubi_bgt0d Kdump: loaded Tainted: G O
(unwind_backtrace) from [<80107b9f>] (show_stack+0xb/0xc)
(show_stack) from [<8033641f>] (dump_stack_lvl+0x2b/0x34)
(dump_stack_lvl) from [<803b7f3f>] (ubi_io_write+0x3ab/0x4a8)
(ubi_io_write) from [<803b817d>] (ubi_io_write_vid_hdr+0x71/0xb4)
(ubi_io_write_vid_hdr) from [<803b6971>] (ubi_eba_copy_leb+0x195/0x2f0)
(ubi_eba_copy_leb) from [<803b939b>] (wear_leveling_worker+0x2ff/0x738)
(wear_leveling_worker) from [<803b86ef>] (do_work+0x5b/0xb0)
(do_work) from [<803b9ee1>] (ubi_thread+0xb1/0x11c)
(ubi_thread) from [<8012c113>] (kthread+0x11b/0x134)
(kthread) from [<80100139>] (ret_from_fork+0x11/0x38)
Exception stack(0x80c43fb0 to 0x80c43ff8)
...
ubi0 error: ubi_dump_flash: err -16 while reading 2048 bytes from PEB
ubi0 error: wear_leveling_worker: error -16 while moving PEB 246 to PEB
ubi0 warning: ubi_ro_mode.part.0: switch to read-only mode
...
ubi0 error: do_work: work failed with error code -16
ubi0 error: ubi_thread: ubi_bgt0d: work failed with error code -16
Yes, I noticed these types of messages too before kernel v5.18. Since commit 013e6292aaf5e4b0("mtd: rawnand: Simplify the locking"), the behavior of nand_get_device() is changed. A process who is invoking nand_get_device() during rebooting won't be stucked, it will get an EBUSY error code, that's why we see the above messages from UBI module.Thanks for identifying this! Yes, the device I'm testing runs v5.15. I can't upgrade it to newer kernels so I backported a lot of ubi patches from mainline, but it seems I missed commit 8cba323437a49a4 which indeed seems to solve the problem.
After commit 8cba323437a49a4("mtd: rawnand: protect access to rawnand devices while in suspend"), the behavior of nand_get_device() is changed back. A process who is invoking nand_get_device() during rebooting will be stucked again, so there should be no error messages in UBI layer.
So, is your kernel version lower than v5.18?