RE: Some thoughts on the loop device

Ray Van Tassle-CRV004 (Ray_Van_Tassle-CRV004@email.mot.com)
Wed, 6 Nov 1996 11:14:19 -0600


> To: linux-kernel@vger.rutgers.edu@INTERNET
>
> Some thoughts on the loop device :
>
> (BTW I'm a newbie to this list and to kernel stuff in general. So a
> lot of this stuff could already have been discussed to death, or
> unnecessary, or wrong, or just plain stupid :)
I've been going over loop device --and spent lots of time finding & fixing
an obscure bug that locks up the whole system. Fixing this & trying to
improve performance, I've tried plently of things..........

>
> The loop device uses 2 buffers on the buffer cache for each disk
> block. The first buffer (allocated by filesystem code calling bread)
> will contain unencrypted data, and the second, allocated by
> do_lo_request, contains encrypted data read from the disk. The chances
> of this second buffer being used by someone again are pretty low, and
> double-buffering takes place.
Yes, no, and yes. The two buffers are like siamese twins--joined at the
hip--whatever happens to one also happens to the other.

> In case of unencrypted filesystems, both
> buffers will contain the same data.
Yes, but so what? 1) for encrypted, they contain different data. 2) For
different fs types (e.g., ext2fs (loop) on dosfs (real device), they contain
different data. 3) For non-encrypted, same fs type, they are identical; but
why would anybody do this? This will rarely be the case, most instances will
be #1 and/or #2

>
> Using getblk and ll_rw_block in do_lo_request looks much cleaner, but
> IMHO it would be better to have a permanently kmalloced BLOCK_SIZE
> bytes which are used for reading/writing to disk. What is needed is a
> function like ll_rw_block, but which takes a char * argument rather
> than struct buffer_head * (something like ll_rw_swap_file). This
> function can be used in do_lo_request to read/write a block from disk
> to the kmalloced scratch buffer. Thus there's only 1 (unencrypted) buffer
> on the cache corresponding to 1 physical disk block.
No, you need one (real-disk) block for every loop block to which I/O is
being done concurrently.
Consider a multiblock read. Pgm requests read for (say) 5 loop blocks.
Each of these is becomes a read of the corresponding real block. This can be
handled by one (maybe two, depending on timing) physical I/O requests.
Worse is write. Pgm writes into loop blocks, these dirty buffers pile up on
the dirty-list. Eventually bdflush wakes up, and posts a write for all
these dirty buffers (up to 500). In an instant, the system has 1000 write
I/O operations, 500 loop and 500 real. We don't want to single-thread all
these through _one_ preallocated block buffer, do we?

>
> Here is an alternative idea for implementation of loop :
>
> 1. Trap requests to the loop device in ll_rw_block and call a loop_map
> function (like the md code).
> 2. If it is _not_ a crypted fs, loop_map will call bmap, get the blockno
on
> actual device and change the real device (bh[i]->b_rdev) to point to
the
> actual device. So the do_request is called on the actual device. (again,

> like md does it)
I did this. It works, but only if the block sizes (loop and real) are
identical, and no encryption, but.......

> 3. If it _is_ a crypted fs, loop_map calls bmap, etc. etc. BUT leaves the
real
> device part as MAJOR_LOOP. So do_lo_request is called which reads
> the block using ll_rw_blk (actual device, etc.) or whatever, and does
the
> crypto stuff.
EIther you do the crypto while moving the data from one buffer to another,
or you do it in-place, in the same buffer. But this means that the buffer
has to change it's identity (device & block number) according to the which
data is in it. But what if other pgms are accessing a block? Say if it's
holding part of an executable program. Depending on timing, nasty things
can happen! Consider two programs continually writing data to the same
block. Don't consider this while you are eating, though!

> 4. This ought to optimise access to uncrypted filesystems. Unfortunately,
> because of the hack in ll_rw_block, it can no longer be an
> independent module.
Not a big deal, I got around this trivially.
>
> Also, we should be able to add code to boot off a looped filesystem.
> Like this :
> You have loadlin, zImage and a large file containing an ext2 root fs in
your
> DOS machine. You want to try out linux and don't want to mess with
partitions.
> Linux boots with /dev/hda1 (MSDOS) as the root fs, finds the file
> C:\LINUX\ROOT loops it to /dev/loop1, then unmounts the root (DOS), mounts
> /dev/loop1 as the root and maybe mounts /dev/hda1 on /bootfs. Would be a
> neat trick :) (initrd does something like this). This could be an
alternative
> to UMSDOS as a way of demonstrating Linux on DOS, and with a *real*
filesystem.
This is *exactly* what I wanted to do when I started on the journey.
Another problem is performance. Mapping each block on the fly is a big
performance hit. I fixed this by building a mapping array for the entire
file, when the loop device is associated with the real file. Have sent this
patch to Linus, but haven't seen it show up yet (as of 2.1.6).

But again, you must be careful. It's easy to solve the
same-block-size/non-crypto variation. But useless, because nobody would
ever do that. N.B., dosfs has 512 byte blocks, ext2fs wants 1024 or 2048
bytes blocks; I don't think ext2 even allows 512 byte blocks.

> Also you might want to add a 'contiguous' option if your root file on dos
is
> defragmented and on contiguous disk blocks to simplify loop_map, giving
> almost the same performance as if it were on a separate partition.
Good idea, but very difficult to enforce upon dos. My mapping patch
automatically gets 99% of the expected gain, and still works good in the
non-optimal case.

>
> Comments invited ...
>
> -- Ganesh
> (ganesh@cse.iitb.ernet.in)
>
----------------------------------------------------------------------------
> Any sufficiently advanced bug is indistinguishable from a feature.
>
----------------------------------------------------------------------------