Re: [PATCH] ubi: wl: Close down wear-leveling before nand is suspended

From: Mårten Lindahl
Date: Mon Sep 09 2024 - 07:31:26 EST



On 9/9/24 05:20, Zhihao Cheng wrote:
在 2024/9/8 3:28, Mårten Lindahl 写道:
If a reboot/shutdown signal with double force (-ff) is triggered when
the erase worker or wear-leveling worker function runs we may end up in
a race condition since the MTD device gets a reboot notification and
suspends the nand flash before the erase or wear-leveling is done. This
will reject all accesses to the flash with -EBUSY.

Sequence for the erase worker function:

    systemctl reboot -ff           ubi_thread

                                 do_work
  __do_sys_reboot
    blocking_notifier_call_chain
      mtd_reboot_notifier
        nand_shutdown
          nand_suspend
                                   __erase_worker
                                     ubi_sync_erase
                                       mtd_erase
                                         nand_erase_nand

                                           # Blocked by suspended chip
                                           nand_get_device
                                             => EBUSY

Similar sequence for the wear-leveling function:

    systemctl reboot -ff           ubi_thread

                                 do_work
  __do_sys_reboot
    blocking_notifier_call_chain
      mtd_reboot_notifier
        nand_shutdown
          nand_suspend
                                   wear_leveling_worker
                                     ubi_eba_copy_leb
                                       ubi_io_write
                                         mtd_write
                                           nand_write_oob

                                             # Blocked by suspended chip
                                             nand_get_device
                                               => EBUSY

  systemd-shutdown[1]: Rebooting.
  ubi0 error: ubi_io_write: error -16 while writing 2048 bytes to PEB
  CPU: 1 PID: 82 Comm: ubi_bgt0d Kdump: loaded Tainted: G           O
  (unwind_backtrace) from [<80107b9f>] (show_stack+0xb/0xc)
  (show_stack) from [<8033641f>] (dump_stack_lvl+0x2b/0x34)
  (dump_stack_lvl) from [<803b7f3f>] (ubi_io_write+0x3ab/0x4a8)
  (ubi_io_write) from [<803b817d>] (ubi_io_write_vid_hdr+0x71/0xb4)
  (ubi_io_write_vid_hdr) from [<803b6971>] (ubi_eba_copy_leb+0x195/0x2f0)
  (ubi_eba_copy_leb) from [<803b939b>] (wear_leveling_worker+0x2ff/0x738)
  (wear_leveling_worker) from [<803b86ef>] (do_work+0x5b/0xb0)
  (do_work) from [<803b9ee1>] (ubi_thread+0xb1/0x11c)
  (ubi_thread) from [<8012c113>] (kthread+0x11b/0x134)
  (kthread) from [<80100139>] (ret_from_fork+0x11/0x38)
  Exception stack(0x80c43fb0 to 0x80c43ff8)
  ...
  ubi0 error: ubi_dump_flash: err -16 while reading 2048 bytes from PEB
  ubi0 error: wear_leveling_worker: error -16 while moving PEB 246 to PEB
  ubi0 warning: ubi_ro_mode.part.0: switch to read-only mode
  ...
  ubi0 error: do_work: work failed with error code -16
  ubi0 error: ubi_thread: ubi_bgt0d: work failed with error code -16

Hi Zhihao Cheng!
Yes, I noticed these types of messages too before kernel v5.18. Since commit 013e6292aaf5e4b0("mtd: rawnand: Simplify the locking"), the behavior of nand_get_device() is changed. A process who is invoking nand_get_device() during rebooting won't be stucked, it will get an EBUSY error code, that's why we see the above messages from UBI module.
After commit 8cba323437a49a4("mtd: rawnand: protect access to rawnand devices while in suspend"), the behavior of nand_get_device() is changed back. A process who is invoking nand_get_device() during rebooting will be stucked again, so there should be no error messages in UBI layer.
So, is your kernel version lower than v5.18?

Thanks for identifying this! Yes, the device I'm testing runs v5.15. I can't upgrade it to newer kernels so I backported a lot of ubi patches from mainline, but it seems I missed commit 8cba323437a49a4 which indeed seems to solve the problem.

Again, thank you very much!

Kind regards

Mårten