Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

From: Ming Lei
Date: Mon Dec 23 2019 - 08:09:03 EST


On Mon, Dec 23, 2019 at 12:22:45PM +0100, Andrea Vai wrote:
> Il giorno mer, 18/12/2019 alle 17.48 +0800, Ming Lei ha scritto:
> > On Wed, Dec 18, 2019 at 09:25:02AM +0100, Andrea Vai wrote:
> > > Il giorno gio, 12/12/2019 alle 05.33 +0800, Ming Lei ha scritto:
> > > > On Wed, Dec 11, 2019 at 11:07:45AM -0500, Theodore Y. Ts'o
> > wrote:
> > > > > On Wed, Dec 11, 2019 at 12:00:58PM +0800, Ming Lei wrote:
> > > > > > I didn't reproduce the issue in my test environment, and
> > follows
> > > > > > Andrea's test commands[1]:
> > > > > >
> > > > > > mount UUID=$uuid /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > SECONDS=0
> > > > > > cp $testfile /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > > umount /mnt/pendrive 2>&1 |tee -a $logfile
> > > > > >
> > > > > > The 'cp' command supposes to open/close the file just once,
> > > > however
> > > > > > ext4_release_file() & write pages is observed to run for
> > 4358
> > > > times
> > > > > > when executing the above 'cp' test.
> > > > >
> > > > > Why are we sure the ext4_release_file() / _fput() is coming
> > from
> > > > the
> > > > > cp command, as opposed to something else that might be running
> > on
> > > > the
> > > > > system under test? _fput() is called by the kernel when the
> > last
> > > >
> > > > Please see the log:
> > > >
> > > >
> > https://lore.kernel.org/linux-scsi/3af3666920e7d46f8f0c6d88612f143ffabc743c.camel@xxxxxxxx/2-log_ming.zip
> > > >
> > > > Which is collected by:
> > > >
> > > > #!/bin/sh
> > > > MAJ=$1
> > > > MIN=$2
> > > > MAJ=$(( $MAJ << 20 ))
> > > > DEV=$(( $MAJ | $MIN ))
> > > >
> > > > /usr/share/bcc/tools/trace -t -C \
> > > > 't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d",
> > args-
> > > > >rwbs, args->sector, args->nr_sector' \
> > > > 't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d",
> > args-
> > > > >rwbs, args->sector, args->nr_sector'
> > > >
> > > > $MAJ:$MIN points to the USB storage disk.
> > > >
> > > > From the above IO trace, there are two write paths, one is from
> > cp,
> > > > another is from writeback wq.
> > > >
> > > > The stackcount trace[1] is consistent with the IO trace log
> > since it
> > > > only shows two IO paths, that is why I concluded that the write
> > done
> > > > via
> > > > ext4_release_file() is from 'cp'.
> > > >
> > > > [1]
> > > >
> > https://lore.kernel.org/linux-scsi/320b315b9c87543d4fb919ecbdf841596c8fbcea.camel@xxxxxxxx/2-log_ming_20191129_150609.zip
> > > >
> > > > > reference to a struct file is released. (Specifically, if you
> > > > have a
> > > > > fd which is dup'ed, it's only when the last fd corresponding
> > to
> > > > the
> > > > > struct file is closed, and the struct file is about to be
> > > > released,
> > > > > does the file system's f_ops->release function get called.)
> > > > >
> > > > > So the first question I'd ask is whether there is anything
> > else
> > > > going
> > > > > on the system, and whether the writes are happening to the USB
> > > > thumb
> > > > > drive, or to some other storage device. And if there is
> > something
> > > > > else which is writing to the pendrive, maybe that's why no one
> > > > else
> > > > > has been able to reproduce the OP's complaint....
> > > >
> > > > OK, we can ask Andrea to confirm that via the following trace,
> > which
> > > > will add pid/comm info in the stack trace:
> > > >
> > > > /usr/share/bcc/tools/stackcount blk_mq_sched_request_inserted
> > > >
> > > > Andrew, could you collect the above log again when running
> > new/bad
> > > > kernel for confirming if the write done by ext4_release_file()
> > is
> > > > from
> > > > the 'cp' process?
> > >
> > > You can find the stackcount log attached. It has been produced by:
> > >
> > > - /usr/share/bcc/tools/stackcount blk_mq_sched_request_inserted >
> > trace.log
> > > - wait some seconds
> > > - run the test (1 copy trial), wait for the test to finish, wait
> > some seconds
> > > - stop the trace (ctrl+C)
> >
> > Thanks for collecting the log, looks your 'stackcount' doesn't
> > include
> > comm/pid info, seems there is difference between your bcc and
> > my bcc in fedora 30.
> >
> > Could you collect above log again via the following command?
> >
> > /usr/share/bcc/tools/stackcount -P -K t:block:block_rq_insert
> >
> > which will show the comm/pid info.
>
> ok, attached (trace_20191219.txt), the test (1 trial) took 3684
> seconds.

>From the above trace:

b'blk_mq_sched_request_inserted'
b'blk_mq_sched_request_inserted'
b'dd_insert_requests'
b'blk_mq_sched_insert_requests'
b'blk_mq_flush_plug_list'
b'blk_flush_plug_list'
b'io_schedule_prepare'
b'io_schedule'
b'rq_qos_wait'
b'wbt_wait'
b'__rq_qos_throttle'
b'blk_mq_make_request'
b'generic_make_request'
b'submit_bio'
b'ext4_io_submit'
b'ext4_writepages'
b'do_writepages'
b'__filemap_fdatawrite_range'
b'ext4_release_file'
b'__fput'
b'task_work_run'
b'exit_to_usermode_loop'
b'do_syscall_64'
b'entry_SYSCALL_64_after_hwframe'
b'cp' [19863]
4400

So this write is clearly from 'cp' process, and it should be one
ext4 fs issue.

Ted, can you take a look at this issue?

>
> > > I also tried the usual test with btrfs and xfs. Btrfs behavior
> > looks
> > > "good". xfs seems sometimes better, sometimes worse, I would say.
> > I
> > > don't know if it matters, anyway you can also find the results of
> > the
> > > two tests (100 trials each). Basically, btrfs is always between 68
> > and
> > > 89 seconds, with a cyclicity (?) with "period=2 trials". xfs looks
> > > almost always very good (63-65s), but sometimes "bad" (>300s).
> >
> > If you are interested in digging into this one, the following trace
> > should be helpful:
> >
> > https://lore.kernel.org/linux-scsi/f38db337cf26390f7c7488a0bc2076633737775b.camel@xxxxxxxx/T/#m5aa008626e07913172ad40e1eb8e5f2ffd560fc6
> >
>
> Attached:
> - trace_xfs_20191223.txt (7 trials, then aborted while doing the 8th),
> times to complete:
> 64s
> 63s
> 64s
> 833s
> 1105s
> 63s
> 64s

oops, looks we have to collect io insert trace with the following bcc script
on xfs for confirming if there is similar issue with ext4, could you run
it again on xfs? And only post the trace done in case of slow 'cp'.


#!/bin/sh

MAJ=$1
MIN=$2
MAJ=$(( $MAJ << 20 ))
DEV=$(( $MAJ | $MIN ))

/usr/share/bcc/tools/trace -t -C \
't:block:block_rq_issue (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector' \
't:block:block_rq_insert (args->dev == '$DEV') "%s %d %d", args->rwbs, args->sector, args->nr_sector'


Thanks,
Ming