regression: data corruption with ext4 on LUKS on nvme with torvalds master

From: Alex Xu (Hello71)
Date: Sat May 08 2021 - 13:54:59 EST


Hi all,

Using torvalds master, I recently encountered data corruption on my ext4
volume on LUKS on NVMe. Specifically, during heavy writes, the system
partially hangs; SysRq-W shows that processes are blocked in the kernel
on I/O. After forcibly rebooting, chunks of files are replaced with
other, unrelated data. I'm not sure exactly what the data is; some of it
is unknown binary data, but in at least one case, a list of file paths
was inserted into a file, indicating that the data is misdirected after
encryption.

This issue appears to affect files receiving writes in the temporal
vicinity of the hang, but affects both new and old data: for example, my
shell history file was corrupted up to many months before.

The drive reports no SMART issues.

I believe this is a regression in the kernel related to something merged
in the last few days, as it consistently occurs with my most recent
kernel versions, but disappears when reverting to an older kernel.

I haven't investigated further, such as by bisecting. I hope this is
sufficient information to give someone a lead on the issue, and if it is
a bug, nail it down before anybody else loses data.

Regards,
Alex.