Re: intermediate summary of ext3-2.4-0.9.4 thread

From: Anton Altaparmakov (aia21@cam.ac.uk)
Date: Thu Aug 02 2001 - 18:55:06 EST


At 00:11 03/08/2001, Matthias Andree wrote:
>On Thu, 02 Aug 2001, Andreas Dilger wrote:
> > > open -> asynchronous, but filename synched on fsync()
> > > rename/link/unlink(/symlink) -> synchronous
> > >
> > > This way, you never need to fsync() the directory, so you never sync()
> > > entries of temporary files. You never lose important files (because the
> > > application uses fsync() and the OS synchs rename/link etc.).
> >
> > Do you read what you are writing? How can a "synchronous" operation for
> > rename/link/unlink/symlink NOT also write out "temporary" files in the
> > same directory? How does calling fsync() on the directory IF YOU REQUIRE
> > SYNCHRONOUS DIRECTORY OPERATIONS differ from making the specific operations
> > synchronous from within the kernel???
>
>Can people please try to understand? Can people please start to THINK
>before flaming?

But we DO understand. I think you should calm down, take a deep breath,
count to 10 and then continue...

>Thus, if the kernel does rename/link synchronously, you'd never ever
>fsync() a directory. To synch a filename to disk, you'd just fsync() the
>filedescriptor (with a SUS compliant system, that is, i. e. ext3 or
>reiserfs, but not ext2).

>Now, if someone opens a temporary file, and nukes it later -- unlink()
>--, and doesn't want it visible, he never calls fsync() for the file.

Unfortunately, your argument contains a fallacy which I believe is due to
you not understanding the fact that file names are stored in their parent
directories and _not_ in the files themselves, thus if you do a fsync() on
a filedescriptor and you would like the name belonging to this
filedescriptor to be synced to disk, the ONLY possible way to do this it to
sync the parent directory in order to commit the file name to disk. On some
file systems it may be possible to optimize this so the sync doesn't affect
the whole directory but only parts of it (e.g. NTFS treats directories as
lots of 4kiB records so you could just sync the necessary record and
nothing else in the simple case where you just modified a record without
side effects), but on other file systems this may well not be possible.

To summarize: basically you want the directory to be synced inside the
fsync() of the filedescriptor (with the advantage of being able to optimize
the directory sync to be a partial directory sync), while others want you
to explicitly sync the directory filedescriptor afterwards.

>In case you haven't noticed, this is about reliability without need to
>fsync() the directory that doesn't all belong to your single, stupid
>process but may have lots of asynchronous data of other processes -
>temporary files for instance. You synch() that as well, which is
>unnecessary and brings down other processes' performance.

That is impossible, see my above explanation, but to emphasize again: if
you write out the file name of one file in a directory you have to write
all of them (unless you can optimize but even then you will be writing more
than one file name), including your temporary files, which you would like
not to be synced to disk. Remember you have to write out whole blocks at
once to the device, you can't just selectively sync part of a device block,
that is _physically_ impossible with todays hardware, never mind that it
would leave your directory on disk structure in an inconsistent state.

For temp files not being written out: it is possible to do only by
introducing a new flag to creat() such as O_TEMP (or something), which
causes the files to be treated specially by the file system so that they
never get committed to disk at all and exist in memory only. This would
optimize those away from the directory sync but it would make the file
system code more complex. It would in fact probably be better to implement
this within the VFS itself and hide the files from the FS altogether
(haven't thought this through...just an idea).

Best regards,

Anton

-- 
   "Nothing succeeds like success." - Alexandre Dumas
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Linux NTFS Maintainer / WWW: http://linux-ntfs.sf.net/
ICQ: 8561279 / WWW: http://www-stu.christs.cam.ac.uk/~aia21/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Tue Aug 07 2001 - 21:00:25 EST