Re: [RFC] Heads up on sys_fallocate()

From: Anton Altaparmakov
Date: Sun Mar 04 2007 - 18:22:45 EST

Next message: Robert Hancock: "Re: CK804 SATA Errors (still got them)"
Previous message: Willy Tarreau: "Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler"
In reply to: Ulrich Drepper: "Re: [RFC] Heads up on sys_fallocate()"
Next in thread: Theodore Tso: "Re: [RFC] Heads up on sys_fallocate()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

On 4 Mar 2007, at 22:38, Ulrich Drepper wrote:

Anton Altaparmakov wrote:
And that is it. No zeroing needs to happen at all because we
have not updated the initialized size of the inode!

When you do it like this, who can the kernel/filesystem *guarantee* that
when the data is written there actually is room on the harddrive?

The blocks are allocated so of course it is guaranteed. Subsequent writes to this file will not generate any allocations thus allocations cannot fail. (-:

What you described seems like using truncate/ftruncate to increase the
file's size. That is not at all what posix_fallocate is for.
posix_fallocate must make sure that the requested blocks on the disk are
reserved (allocated) for the file's use and that at no point in the
future will, say, a msync() fail because a mmap(MAP_SHARED) page has
been written to.

No that is different. I described performing the allocations in the volume bitmap, i.e. for each allocated block the corresponding "in use" bit is set in the bitmap (NTFS uses a linear bitmap where byte 0, bit 0 == physical block 0 of volume, byte 0, bit 1 == physical block 1 of volume, ... byte 1, bit 0 == block 8 of volume, ...).

Also I described updating the extent map of the inode such that it describes the physical blocks as belonging to the file, thus you would have "logical file block X corresponds to physical block Y on volume" entries entered into the extent map of the inode and they would describe the just allocated blocks.

Finally I described updating the allocated size in the inode which basically says "there are that many bytes worth of blocks allocated to this inode".

And optionally I described updating the data size in the inode which basically says "this file has size Z bytes".

And I specifically did NOT update the initialized size in the inode thus it will remain at its old value thus all new allocated blocks will be considered as present but not initialized thus a read will always return zero whilst a write will do the right thing and pad with zeroes as necessary (if the write is smaller than the block size, etc).

Note that you are right that this is like truncate in NTFS for non- sparse enabled inodes/volumes.

But for sparse ones, instead of doing any allocations in the bitmap and entering them in the extent map, you would simply add a single entry to the extent map that says "X blocks allocated starting at logical block Y corresponding to no physical blocks, i.e. they are sparse". You would then also update the allocated size and data size as above and now you can even (but do not have to) update the initialized size to be equal to the data size as the file can be considered fully initialized because it is sparse. As an implementation detail this truncate operation would not modify the compressed size of the inode (i.e. the really used on-disk space, i.e. what you get from running "du" as that does not change when you add sparse blocks) whilst the fallocate described above would update the compressed size (if the file is sparse or compressed - there is no compressed size in the inode if the inode is not sparse/ compressed) because the file now occupies more blocks on disk even if they are actually not initialized.

Best regards,

Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Robert Hancock: "Re: CK804 SATA Errors (still got them)"
Previous message: Willy Tarreau: "Re: [ANNOUNCE] RSDL completely fair starvation free interactive cpu scheduler"
In reply to: Ulrich Drepper: "Re: [RFC] Heads up on sys_fallocate()"
Next in thread: Theodore Tso: "Re: [RFC] Heads up on sys_fallocate()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]