Re: [PATCH] mmc: dw_mmc: Fix IDMAC operation with pages bigger than 4K
From: Sam Protsenko
Date: Tue Apr 02 2024 - 18:43:53 EST
On Thu, Mar 7, 2024 at 1:52 AM Arnd Bergmann <arnd@xxxxxxxx> wrote:
>
> On Thu, Mar 7, 2024, at 00:20, Sam Protsenko wrote:
> > Commit 616f87661792 ("mmc: pass queue_limits to blk_mq_alloc_disk") [1]
> > revealed the long living issue in dw_mmc.c driver, existing since the
> > time when it was first introduced in commit f95f3850f7a9 ("mmc: dw_mmc:
> > Add Synopsys DesignWare mmc host driver."), also making kernel boot
> > broken on platforms using dw_mmc driver with 16K or 64K pages enabled,
> > with this message in dmesg:
> >
> > mmcblk: probe of mmc0:0001 failed with error -22
> >
> > That's happening because mmc_blk_probe() fails when it calls
> > blk_validate_limits() consequently, which returns the error due to
> > failed max_segment_size check in this code:
> >
> > /*
> > * The maximum segment size has an odd historic 64k default that
> > * drivers probably should override. Just like the I/O size we
> > * require drivers to at least handle a full page per segment.
> > */
> > ...
> > if (WARN_ON_ONCE(lim->max_segment_size < PAGE_SIZE))
> > return -EINVAL;
> >
> > In case when IDMAC (Internal DMA Controller) is used, dw_mmc.c always
> > sets .max_seg_size to 4 KiB:
> >
> > mmc->max_seg_size = 0x1000;
> >
> > The comment in the code above explains why it's incorrect. Arnd
> > suggested setting .max_seg_size to .max_req_size to fix it, which is
> > also what some other drivers are doing:
> >
> > $ grep -rl 'max_seg_size.*=.*max_req_size' drivers/mmc/host/ | \
> > wc -l
> > 18
>
> Nice summary!
>
> > This change is not only fixing the boot with 16K/64K pages, but also
> > leads to a better MMC performance. The linear write performance was
> > tested on E850-96 board (eMMC only), before commit [1] (where it's
> > possible to boot with 16K/64K pages without this fix, to be able to do
> > a comparison). It was tested with this command:
> >
> > # dd if=/dev/zero of=somefile bs=1M count=500 oflag=sync
> >
> > Test results are as follows:
> >
> > - 4K pages, .max_seg_size = 4 KiB: 94.2 MB/s
> > - 4K pages, .max_seg_size = .max_req_size = 512 KiB: 96.9 MB/s
> > - 16K pages, .max_seg_size = 4 KiB: 126 MB/s
> > - 16K pages, .max_seg_size = .max_req_size = 2 MiB: 128 MB/s
> > - 64K pages, .max_seg_size = 4 KiB: 138 MB/s
> > - 64K pages, .max_seg_size = .max_req_size = 8 MiB: 138 MB/s
>
> Thanks for sharing these results. From what I can see here, the
> performance changes significantly with the page size, but barely
> with the max_seg_size, so this does not have the effect I was
> hoping for. On a more positive note this likely means that we
> don't have to urgently backport your fix.
>
> This could mean that either there is not much coalescing across
> pages after all, or that the bottleneck is somewhere else.
>
> > diff --git a/drivers/mmc/host/dw_mmc.c b/drivers/mmc/host/dw_mmc.c
> > index 8e2d676b9239..cccd5633ff40 100644
> > --- a/drivers/mmc/host/dw_mmc.c
> > +++ b/drivers/mmc/host/dw_mmc.c
> > @@ -2951,8 +2951,8 @@ static int dw_mci_init_slot(struct dw_mci *host)
> > if (host->use_dma == TRANS_MODE_IDMAC) {
> > mmc->max_segs = host->ring_size;
> > mmc->max_blk_size = 65535;
> > - mmc->max_seg_size = 0x1000;
> > - mmc->max_req_size = mmc->max_seg_size * host->ring_size;
> > + mmc->max_req_size = DW_MCI_DESC_DATA_LENGTH * host->ring_size;
> > + mmc->max_seg_size = mmc->max_req_size;
>
> The change looks good to me.
>
> I see that the host->ring_size depends on PAGE_SIZE as well:
>
> #define DESC_RING_BUF_SZ PAGE_SIZE
> host->ring_size = DESC_RING_BUF_SZ / sizeof(struct idmac_desc_64addr);
> host->sg_cpu = dmam_alloc_coherent(host->dev,
> DESC_RING_BUF_SZ, &host->sg_dma, GFP_KERNEL);
>
> I don't see any reason for the ring buffer size to be tied to
> PAGE_SIZE at all, it was probably picked as a reasonable
> default in the initial driver but isn't necessarily ideal.
>
> From what I can see, the number of 4KB elements in the
> ring can be as small as 128 (4KB pages, 64-bit addresses)
> or as big as 4096 (64KB pages, 32-bit addresses), which is
> quite a difference. If you are still motivated to drill
> down into this, could you try changing DESC_RING_BUF_SZ
> to a fixed size of either 4KB or 64KB and test again
> with the opposite page size, to see if that changes the
> throughput?
>
Hi Arnd,
Sorry for the late reply. I'm a bit of busy with something else right
now (trying to enable this same driver for Exynos850 in U-Boot, hehe),
I'll try to carve out some time later and tinker with
DESC_RING_BUF_SZ. But for now, can we just apply this patch as is? As
I understand, it's fixing quite a major issue (at least from what I
heard), so it would be nice to have it in -next and -stable. Does that
sound reasonable?
Thanks!
> If a larger ring buffer gives us significantly better
> throughput, we may want to always use a higher number
> independent of page size. On the other hand, if the
> 64KB number (the 138MB/s) does not change with a smaller
> ring, we may as well reduce that in order to limit the
> maximum latency that is caused by a single I/O operation.
>
> Arnd