Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6
From: Ming Lei
Date: Wed Dec 25 2019 - 21:27:37 EST
On Wed, Dec 25, 2019 at 12:17:22AM -0500, Theodore Y. Ts'o wrote:
> On Tue, Dec 24, 2019 at 09:27:07AM +0800, Ming Lei wrote:
> > The ext4_release_file() should be run from read() or write() syscall if
> > Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected behavior
> > for ext4_release_file() to be run thousands of times when just
> > running 'cp' once, see comment of ext4_release_file():
>
> What's your evidence of that? As opposed to the writeback taking a
> long time, leading to the *one* call of ext4_release_file taking a
> long time? If it's a big file, we might very well be calliing
> ext4_writepages multiple times, from a single call to
> __filemap_fdatawrite_range().
>
> You confused mightily from that assertion, and that caused me to make
> assumptions that cp was doing something crazy. But I'm quite conviced
> now that this is almost certainly not what is happening.
>
> > > I suspect the next step is use a blktrace, to see what kind of I/O is
> > > being sent to the USB drive, and how long it takes for the I/O to
> > > complete. You might also try to capture the output of "iostat -x 1"
> > > while the script is running, and see what the difference might be
> > > between a kernel version that has the problem and one that doesn't,
> > > and see if that gives us a clue.
> >
> > That isn't necessary, given we have concluded that the bad write
> > performance is caused by broken write order.
>
> I didn't see any evidence of that from what I had in my inbox, so I
> went back to the mailing list archives to figure out what you were
> talking about. Part of the problem is this has been a very
> long-spanning thread, and I had deleted from my inbox all of the parts
> relating to the MQ scheduler since that was clearly Not My Problem. :-)
>
> So, summarizing the most of the thread. The problem started when we
> removed the legacy I/O scheduler, since we are now only using the MQ
> scheduler. What the kernel is sending is long writes (240 sectors),
> but it is being sent as an interleaved stream of two sequential
> writes. This particular pendrive can't handle this workload, because
> it has a very simplistic Flash Translation Layer. Now, this is not
> *broken*, from a storage perspective; it's just that it's more than
> the simple little brain of this particular pen drive can handle.
>
> Previously, with a single queue, and specially since the queue depth
> supported by this pen drive is 1, the elevator algorithm would sort
> the I/O requests so that it would be mostly sequential, and this
> wouldn't be much of a problem. However, once the legacy I/O stack was
> removed, the MQ stack is designed so that we don't have to take a
> global lock in order to submit an I/O request. That also means that
> we can't do a full elevator sort since that would require locking all
> of the queues.
>
> This is not a problem, since HDD's generally have a 16 deep queue, and
> SSD's have a super-deep queue depth since they get their speed via
> parallel writes to different flash chips. Unfortunately, it *is* a
> problem for super primitive USB sticks.
>
> > So far, the reason points to the extra writeback path from exit_to_usermode_loop().
> > If it is not from close() syscall, the issue should be related with file reference
> > count. If it is from close() syscall, the issue might be in 'cp''s
> > implementation.
>
> Oh, it's probably from the close system call; and it's *only* from a
> single close system call. Because there is the auto delayed
Right. Looks I mis-interpreted the stackcount log, IOs are submitted
from single close syscall.
> allocation resolution to protect against buggy userspace, under
> certain circumstances, as I explained earlier, we force a full
> writeout on a close for a file decsriptor which was opened with an
> O_TRUNC. This is by *design*, since we are trying to protect against
> buggy userspace (application programmers vastly outnumber file system
> programmers, and far too many of them want O_PONY). This is Working
> As Intended.
>
> You can disable it by deleting the test file before the cp:
>
> rm -f /mnt/pendrive/$testfile
>
> Or you can disable the protection against stupid userspace by using
> the noauto_da_alloc mount option. (But then if you have a buggy game
> program which writes the top-ten score file by using open(2) w/
> O_TRUNC, and then said program closes the OpenGL library, and the
> proprietary 3rd party binary-only video driver wedges the X server
> requiring a hard reset to recover, and the top-ten score file becomes
> a zero-length file, don't come crying to me... Or if a graphical text
> editor forgets to use fsync(2) before saving a source file you spent
> hours working on, and then the system crashes at exactly the wrong
> moment and your source file becomes zero-length, against, don't come
> crying to me. Blame the stupid application programmer which wrote
> your text editor who decided to skip the fsync(2), or who decided that
> copying the ACL's and xattrs was Too Hard(tm), and so opening the file
> with O_TRUNC and rewriting the file in place was easier for the
> application programmer.)
>
> In any case, I think this is all working all as intended. The MQ I/O
> stack is optimized for modern HDD and SSD's, and especially SSD's.
> And the file system assumes that parallel sequential writes,
> especially if they are large, is really not a big deal, since that's
> what NCQ or massive parallelism of pretty much all SSD's want.
> (Again, ignoring the legacy of crappy flash drives.
>
> You can argue with storage stack folks about whether we need to have
> super-dumb mode for slow, crappy flash which uses a global lock and a
> global elevator scheduler for super-crappy flash if you want. I'm
> going to stay out of that argument.
As I mentioned in the following link:
https://lore.kernel.org/linux-scsi/20191224084721.GA27248@xxxxxxxxxx/
The reason is that ioc_batching and BDI congestion is removed by blk-mq.
Then after queue is congested, multiple sequential writes can be done
concurrently at the same time. Before ioc_batching and BDI congestion is
removed, writes are done serialized from multiple processes actually, so
IOs are dispatched to drive in strict sequential order.
This way can't be an issue for SSD.
Maybe we need to be careful for HDD., since the request count in scheduler
queue is double of in-flight request count, and in theory NCQ should only
cover all in-flight 32 requests. I will find a sata HDD., and see if
performance drop can be observed in the similar 'cp' test.
Thanks,
Ming