Hello!Yes, we inadvertently changed the behavior of "data_err=abort" when
On Fri 20-12-24 21:39:39, Baokun Li wrote:
On 2024/12/20 18:36, Jan Kara wrote:Well, that is not quite true. Normally, we run in delalloc mode and use
On Fri 20-12-24 14:07:55, libaokun@xxxxxxxxxxxxxxx wrote:Thank you for your review and feedback!
From: Baokun Li <libaokun1@xxxxxxxxxx>
If we mount an ext4 fs with data_err=abort option, it should abort on
file data write error. But if the extent is unwritten, we won't add a
JI_WAIT_DATA bit to the inode, so jbd2 won't wait for the inode's data
to be written back and check the inode mapping for errors. The data
writeback failures are not sensed unless the log is watched or fsync
is called.
Therefore, when data_err=abort is enabled, the journal is aborted when
an I/O error is detected in ext4_end_io_end() to make users who are
concerned about the contents of the file happy.
Signed-off-by: Baokun Li <libaokun1@xxxxxxxxxx>
I'm not opposed to this change but I think we should better define theTotally agree, the definition of this option is a bit vague right now.
expectations around data_err=abort.
It's semantics have changed implicitly with iterations of the version.
Originally in v2.6.28-rc1 commit 5bf5683a33f3 (“ext4: add an option to
control error handling on file data”) introduced “data_err=abort”, the
implementation of this mount option relies on JBD2_ABORT_ON_ SYNCDATA_ERR,
and this flag takes effect when the journal_finish_inode_data_buffers()
function returns an error. At this point in ext4_write_end(), in ordered
mode, we add the inode to the ordered data list, whether it is an append
write or an overwrite write. Therefore all write failures in ordered mode
will abort the journal. This is also the semantics in the documentation
- “Abort the journal if an error occurs in a file data buffer in ordered
mode.”.
ext4_da_write_end() to finish writes. Thus normally inode was not added to
the transaction's list of inodes to flush (since 3.8 where this behavior
got implemented by commit f3b59291a69d ("ext4: remove calls to
ext4_jbd2_file_inode() from delalloc write path")). Then the commit
06bd3c36a733 (“ext4: fix data exposure after a crash”) in 4.7 realized this
is broken and fixed things to properly flush blocks when needed.
Actually the data=ordered mode always guaranteed we will not expose staleYes, compared to the data=writeback mode, the semantics of data=ordered can
data but never guaranteed all the written data will be flushed.
ThusI think this is the initial design problem of data_err=abort. The
data_err=abort always controlled "what should jbd2 do when it spots error
when flushing data" rather than any kind of guarantee that IO error on any
data writeback results in filesystem abort.
After all page writeback canGood point! "data_err=abort" did have this problem before. If the
easily try to flush the data before a transaction commit and hit IO error
and jbd2 then won't notice the problem (the page will be clean already) and
it was always like that.
Direct I/O writes are okay because the inode size is updated after allOK, so you really want any write IO error to result in filesystem abort?For example the dependency onAs was the original intent of introducing "data_err=abort", users who
data=ordered is kind of strange and the current semantics of data_err=abort
are hard to understand for admins (since they are mostly implementation
defined). For example if IO error happens on data overwrites, the
filesystem will not be aborted because we don't bother tracking such data
as ordered (for performance reasons). Since you've apparently talked to people
using this option: What are their expectations about the option?
use this option are concerned about corruption of critical data spreading
silently, that is, they are concerned that the data actually read does
not match the data written.
Both page writeback and direct IO writes?
But as you said, we don't track overwrite writes for performance reasons.I agree it makes sense to make the semantics of data_err=abort more
But compared to the poor performance of journal_data and the risk of the
drop cache exposing stale, not being able to sense data errors on overwrite
writes is acceptable.
After enabling ‘data_err=abort’ in dioread_nolock mode, after drop_cache
or remount, the user will not see the unexpected all-zero data in the
unwritten area, but rather the earlier consistent data, and the data in
the file is trustworthy, at the cost of some trailing data.
On the other hand, adding a new written extents and converting an
unwritten extents to written both expose the data to the user, so the user
is concerned about whether the data is correct at that point.
In general, I think we can update the semantics of “data_err=abort” to,
“Abort the journal if the file fails to write back data on extended writes
in ORDERED mode”. Do you have any thoughts on this?
obvious. Based on the usecase you've described - i.e., rather take the
filesystem down on write IO error than risk returning old data later - it
would make sense to me to also do this on direct IO writes.
Also I would doFor data=journal mode, the journal itself will abort when data is abnormal.
this regardless of data=writeback/ordered/journalled mode because although
users wanting data_err=abort behavior will also likely want the guarantees
of data=ordered mode, these are two different things
and I can imagine use
cases for setups with data=writeback and data_err=abort as well (e.g. for
scratch filesystems which get recreated on each system startup).
Honza