Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

From: Theodore Y. Ts'o
Date: Wed Dec 25 2019 - 00:18:14 EST


On Tue, Dec 24, 2019 at 09:27:07AM +0800, Ming Lei wrote:
> The ext4_release_file() should be run from read() or write() syscall if
> Fedora 30's 'cp' is implemented correctly. IMO, it isn't expected behavior
> for ext4_release_file() to be run thousands of times when just
> running 'cp' once, see comment of ext4_release_file():

What's your evidence of that? As opposed to the writeback taking a
long time, leading to the *one* call of ext4_release_file taking a
long time? If it's a big file, we might very well be calliing
ext4_writepages multiple times, from a single call to
__filemap_fdatawrite_range().

You confused mightily from that assertion, and that caused me to make
assumptions that cp was doing something crazy. But I'm quite conviced
now that this is almost certainly not what is happening.

> > I suspect the next step is use a blktrace, to see what kind of I/O is
> > being sent to the USB drive, and how long it takes for the I/O to
> > complete. You might also try to capture the output of "iostat -x 1"
> > while the script is running, and see what the difference might be
> > between a kernel version that has the problem and one that doesn't,
> > and see if that gives us a clue.
>
> That isn't necessary, given we have concluded that the bad write
> performance is caused by broken write order.

I didn't see any evidence of that from what I had in my inbox, so I
went back to the mailing list archives to figure out what you were
talking about. Part of the problem is this has been a very
long-spanning thread, and I had deleted from my inbox all of the parts
relating to the MQ scheduler since that was clearly Not My Problem. :-)

So, summarizing the most of the thread. The problem started when we
removed the legacy I/O scheduler, since we are now only using the MQ
scheduler. What the kernel is sending is long writes (240 sectors),
but it is being sent as an interleaved stream of two sequential
writes. This particular pendrive can't handle this workload, because
it has a very simplistic Flash Translation Layer. Now, this is not
*broken*, from a storage perspective; it's just that it's more than
the simple little brain of this particular pen drive can handle.

Previously, with a single queue, and specially since the queue depth
supported by this pen drive is 1, the elevator algorithm would sort
the I/O requests so that it would be mostly sequential, and this
wouldn't be much of a problem. However, once the legacy I/O stack was
removed, the MQ stack is designed so that we don't have to take a
global lock in order to submit an I/O request. That also means that
we can't do a full elevator sort since that would require locking all
of the queues.

This is not a problem, since HDD's generally have a 16 deep queue, and
SSD's have a super-deep queue depth since they get their speed via
parallel writes to different flash chips. Unfortunately, it *is* a
problem for super primitive USB sticks.

> So far, the reason points to the extra writeback path from exit_to_usermode_loop().
> If it is not from close() syscall, the issue should be related with file reference
> count. If it is from close() syscall, the issue might be in 'cp''s
> implementation.

Oh, it's probably from the close system call; and it's *only* from a
single close system call. Because there is the auto delayed
allocation resolution to protect against buggy userspace, under
certain circumstances, as I explained earlier, we force a full
writeout on a close for a file decsriptor which was opened with an
O_TRUNC. This is by *design*, since we are trying to protect against
buggy userspace (application programmers vastly outnumber file system
programmers, and far too many of them want O_PONY). This is Working
As Intended.

You can disable it by deleting the test file before the cp:

rm -f /mnt/pendrive/$testfile

Or you can disable the protection against stupid userspace by using
the noauto_da_alloc mount option. (But then if you have a buggy game
program which writes the top-ten score file by using open(2) w/
O_TRUNC, and then said program closes the OpenGL library, and the
proprietary 3rd party binary-only video driver wedges the X server
requiring a hard reset to recover, and the top-ten score file becomes
a zero-length file, don't come crying to me... Or if a graphical text
editor forgets to use fsync(2) before saving a source file you spent
hours working on, and then the system crashes at exactly the wrong
moment and your source file becomes zero-length, against, don't come
crying to me. Blame the stupid application programmer which wrote
your text editor who decided to skip the fsync(2), or who decided that
copying the ACL's and xattrs was Too Hard(tm), and so opening the file
with O_TRUNC and rewriting the file in place was easier for the
application programmer.)

In any case, I think this is all working all as intended. The MQ I/O
stack is optimized for modern HDD and SSD's, and especially SSD's.
And the file system assumes that parallel sequential writes,
especially if they are large, is really not a big deal, since that's
what NCQ or massive parallelism of pretty much all SSD's want.
(Again, ignoring the legacy of crappy flash drives.

You can argue with storage stack folks about whether we need to have
super-dumb mode for slow, crappy flash which uses a global lock and a
global elevator scheduler for super-crappy flash if you want. I'm
going to stay out of that argument.

- Ted