Re: True fsync() in Linux (on IDE)

From: Peter Zaitsev
Date: Fri Mar 19 2004 - 15:08:26 EST


On Fri, 2004-03-19 at 11:36, Hans Reiser wrote:

> mysql fsync()'s a file, which it thinks guarantees that all of a mysql
> transaction has reached disk. The disk write caches it. You let fsync
> return. It is not on disk. mysql performs its mysql commit, and writes
> a mysql commit record which reaches disk, but not all of the transaction
> is on disk. The system crashes. mysql plays the log. mysql has
> internal corruption. User calls Peter. Peter asks, what do you expect
> when you use a piece of shit like reiserfs? User doesn't care about our
> internal squabbling and goes back to using windows which does proper
> commits.

This is right,

We had some unexplained data corruptions in Innodb which can be
explained by broken fsync(), but in the most cases the scenario is less
gloomy. Users just do not see some of last committed transactions if
they test durability by shutting off the power, which is however already
not good enough for critical applications.

However this is due to external pre-caution Innodb does. It uses
"double write buffer", which basically means each page is first written
to some small page based log file, and only afterwards written to the
proper place on the disk. We have to do it even with proper fsync()
implementation as there is still possibility to crash in the middle of
fsync (or synchronous write) which will result in partial page write.
Think for example about the case when page crosses stripe boundary on
RAID.


If file system would guaranty atomicity of write() calls (synchronous
would be enough) we could disable it and get good extra performance.



--
Peter Zaitsev, Senior Support Engineer
MySQL AB, www.mysql.com

Meet the MySQL Team at User Conference 2004! (April 14-16, Orlando,FL)
http://www.mysql.com/uc2004/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/