RFC: from FIBMAP to FIONDEV

Werner Almesberger (almesber@lrc.di.epfl.ch)
Fri, 11 Jun 1999 10:48:29 +0200


About two years ago, when there was much discussion about reiserfs, md,
and other advanced file system structures, it became apparent that good
old FIBMAP wasn't sufficient. So I proposed a successor for it called
FIPHYLOC. Fortunately, this one never made it into the mainstream
kernel ;-)

Now, following some discussions at Linux Expo, I have a much improved
design. It adds the following new features (with respect to FIBMAP):

- works with byte granularity, so it can handle files with data starting
at a non-zero offset in blocks (e.g. "block splitting" in reiserfs and
SGI's XFS)
- also returns the device number, so it can handle files spread over
multiple block devices (e.g. md, LVM)
- returns areas of consecutive bytes, not individual block addresses, so
fewer ioctls are required
- can report various anomalies, such as unstable or ambiguous storage
(see below; with FIBMAP, the caller has to know every single special
case)

I've attached a detailed description. I'll work on a kernel patch and the
corresponding extension of LILO when this has stabilized.

Opinions welcome.

- Werner

----- Items to be added to linux/fs.h -----------------------------------------

#define FIONDEV _IOW(0x00,3,struct fiondev_result)
/* get the location of the contiguous area on the underlying block
device, which contains data of the file or block device, starting at
the current file position. */

/*
* Flags returned by FIONDEV
*/

#define FIONDEV_ERROR 1 /* request failed, no data was returned */
#define FIONDEV_UNSTABLE 2 /* location is temporary (e.g. scratch file,
RAM disk) */
#define FIONDEV_COPIES 4 /* location is replicated */
#define FIONDEV_MISALIGNED 8 /* current file position it not at a boundary */
#define FIONDEV_FINAL 16 /* "device" is a physical device requiring no
further translation */

struct fiondev_result {
int flags; /* FIONDEV_* flags */
dev_t device; /* underlying block device */
loff_t start; /* byte offset in block device */
loff_t size; /* area size (in bytes); 0 at EOF; may exceed
file or block device size */
};

-------------------------------------------------------------------------------

FIONDEV can be applied to files and to block devices. It returns the
location of data at the current file position on the underlying block
device. The information returned describes a consecutive area on the
block device, determined by its start and its size. FIONDEV advances
the file pointer to the end of this data area.

FIONDEV can be applied recursively to obtain the actual physical
location of a file, e.g. file -> loop device -> (file) -> md (RAID) ->
SCSI disk. "(file)" is implicitly done by the loop device.

Note that the size returned may actually exceed the size of the file
or the block device being queried. It is the responsibility of the
caller to use only information that corresponds to actually available
data areas. Likewise, the size returned be may less than the actual
contiguous area. (E.g. in the case of consecutive extents.)

If the current file position is at the end of the file or device,
"size" is set to zero. The value of the fields "device" and "start"
is undefined in this case.

Unmapped areas ("holes") should be returned with "device" and "start"
set to zero.

Block devices with no underlying devices should set FIONDEV_FINAL,
return their own device number in the device field, the current file
position in "start", and a value corresponding at least to the size
of the device in "size".

FIONDEV may return the following flags:

FIONDEV_ERROR
The content of the other fields in the result structure is
undefined. Additional flags may be set to indicate the failure
reason. The current file position may be undefined. No further
FIONDEV ioctls should be issued before closing the file.

FIONDEV_UNSTABLE
Location is on unstable storage, e.g. in a designated temporary
file, a RAM disk, etc. Boot loaders must not map such areas.
Files or block devices that are completely unsuitable for mapping
should set FIONDEV_ERROR in addition to FIONDEV_UNSTABLE.

FIONDEV_COPIES
Data is contained in multiple areas, e.g. replicated on several
volumes in a RAID array. A boot loader may use this information,
but by doing so, it may not preserve all of the semantics of the
storage. (E.g. in the case of a RAID, the RAID storage itself may
survive the failure of some disks, but a boot loader may fail if
the disks it mapped are among the ones that failed.)

If there are multiple locations at which the physically same data
block can be found, FIONDEV_COPIES should only be set if none of
them can be relied to remain valid. Otherwise, the most long-lived
location should be returned.

FIONDEV_MISALIGNED
The current position is not at the beginning of a contiguous data
area on the underlying block device. Setting this flag is optional.
However, if no data is returned as a result of this condition (i.e.
if FIONDEV_ERROR is set), FIONDEV_MISALIGNED must be set to
indicate the reason.

FIONDEV_FINAL
No further resolution is necessary on the device returned. This
flag must be set by all physical devices when FIONDEV is invoked on
them. It may also be set on the last translation step leading to a
physical device, thereby eliminating a redundant FIONDEV ioctl on
the physical device.

If the physical location of a file or device changes during mapping,
subsequent FIONDEV ioctls may fail. Note however that extensively
probing for such a condition would be a wasted effort, because changes
could always occur right after mapping.

During migration from FIBMAP to FIONDEV (i.e. for the next few years),
any errno-style errors returned by FIONDEV should be interpreted as lack
of support for FIONDEV. In order to ease migration, FIONDEV should only
return a negative value (i.e. report an errno-style error) for
conditions that cannot be described in more detail using flags.

-- 
  _________________________________________________________________________
 / Werner Almesberger, DI-ICA,EPFL,CH   werner.almesberger@lrc.di.epfl.ch /
/_IN_R_131__Tel_+41_21_693_6621__Fax_+41_21_693_6610_____________________/

- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.rutgers.edu Please read the FAQ at http://www.tux.org/lkml/