Re: [PATCH] Write the inode itself in block_fsync()

From: OGAWA Hirofumi
Date: Fri Mar 10 2006 - 10:16:21 EST


Bart Samwel <bart@xxxxxxxxx> writes:

> Andrew Morton wrote:
>> Sam Vilain <sam@xxxxxxxxxx> wrote:
>>> OGAWA Hirofumi wrote:
> >>
>>> >For block device's inode, we don't write a inode's meta data
>>> >itself. But, I think we should write inode's meta data for fsync().
>>>
>>> Ouch... won't that halve performance of database transaction logs?
>>
>> Yes, it could well cause a lot more seeking to do atime and/or mtime
>> writes. Which aren't terribly important, really.
>>
>> Unless I'm missing something, I suspect we'd be better off without this,
>> even though it's a correctness fix :(
>
> Maybe atime/mtime aren't important, but I would be unhappy if a file
> size change wasn't written to disk on fsync.

Please don't worry, we should be doing a right thing for normal files
already. This patch is just for block device file.

> Anyway, shouldn't databases be using a combination of fixed-size files
> and fdatasync? fsync doesn't perform well by definition, and I guess the
> only reason databases still use it is because the kernel failed to
> implement the sucky part of the behaviour.

Yes, I agree. The changes of atime/mtime only sets I_DIRTY_SYNC, so,
usually this patch doesn't change fdatasync() at all.

Umm... however, I also can understand what akpm says.... check some databases.

berkeley db 4.4: use fdatasync() if available
mysql 5.0: use fdatasync() if available (innobase)
use fsync() (bdb)
postgresql: use fdatasync() if available
sqlite: use fsync

After all, I don't know.

> A complex but perhaps viable suggestion: as atime/mtime are stored
> on disk in second granularity (on ext3 at least, don't know about
> other fss), wouldn't it somehow be possible to only regard
> atime/mtime changes as real changes when i_(a|c)time.tv_sec changes?
> This would enable fsync to write the inode once every second instead
> of on every fsync. The performance drop would be much less dramatic
> than writing the inode on every fsync, and it would at least yield
> correct behaviour.

Yes, and we are already doing it.

Thanks.
--
OGAWA Hirofumi <hirofumi@xxxxxxxxxxxxxxxxxx>
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/