Re: [Regression] File corruptions on SSD in 1st M.2 socket of AsRock X600M-STX
From: Keith Busch
Date: Wed Jan 08 2025 - 10:08:08 EST
On Wed, Jan 08, 2025 at 03:38:53PM +0100, Thorsten Leemhuis wrote:
> [side note TWIMC: regression tracking is sadly kinda dormant temporarily
> (hopefully this will change again soon), but this was brought to my
> attention and looked kinda important]
>
> Hi, Thorsten here, the Linux kernel's regression tracker.
>
> Adrian, Christoph I noticed a report about a regression in
> bugzilla.kernel.org that appears to be caused by a change you too
> handled a while ago -- or it exposed an earlier problem:
>
> 3710e2b056cb92 ("nvme-pci: clamp max_hw_sectors based on DMA optimized
> limitation") [v6.4-rc3]
...
> > The bug is triggered by the patch "nvme-pci: clamp max_hw_sectors
> > based on DMA optimized limitation" (see https://lore.kernel.org/linux-
> > iommu/20230503161759.GA1614@xxxxxx/ ) introduced in 6.3.7
> >
> > To examine the situation, I added this debug info (all files are
> > located in `drivers/nvme/host`):
> >
> >> --- core.c.orig 2025-01-03 14:27:38.220428482 +0100
> >> +++ core.c 2025-01-03 12:56:34.503259774 +0100
> >> @@ -3306,6 +3306,7 @@
> >> max_hw_sectors = nvme_mps_to_sectors(ctrl, id->mdts);
> >> else
> >> max_hw_sectors = UINT_MAX;
> >> + dev_warn(ctrl->device, "id->mdts=%d, max_hw_sectors=%d,
> >> ctrl->max_hw_sectors=%d\n", id->mdts, max_hw_sectors, ctrl->max_hw_sectors);
> >> ctrl->max_hw_sectors =
> >> min_not_zero(ctrl->max_hw_sectors, max_hw_sectors);
> >
> > 6.3.6 (last version w/o mentioned patch and w/o data corruption) says:
> >
> >> [ 127.196212] nvme nvme0: id->mdts=7, max_hw_sectors=1024,
> >> ctrl->max_hw_sectors=16384
> >> [ 127.203530] nvme nvme0: allocated 40 MiB host memory buffer.
> >
> > 6.3.7 (first version w/ mentioned patch and w/ data corruption) says:
> >
> >> [ 46.436384] nvme nvme0: id->mdts=7, max_hw_sectors=1024,
> >> ctrl->max_hw_sectors=256
> >> [ 46.443562] nvme nvme0: allocated 40 MiB host memory buffer.
It should always be okay to do smaller transfers as long as everything
stays aligned the logical block size. I'm guessing the dma opt change
has exposed some other flaw in the nvme controller. For example, two
consecutive smaller writes are hitting some controller side caching bug
that a single larger trasnfer would have handled correctly. The host
could have sent such a sequence even without the patch reverted, but
happens to not be doing that in this particular test.