Re: [PATCH] stat: fix inconsistency between struct stat and struct compat_stat

From: Linus Torvalds
Date: Tue Apr 12 2022 - 12:31:39 EST


On Mon, Apr 11, 2022 at 11:41 PM Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>
> Also, if the st_dev and st_rdev values are 32-bit, we don't have to use
> old_valid_dev to test if the value fits into them. This fixes -EOVERFLOW
> on filesystems that are on NVMe because NVMe uses the major number 259.

The problem with this part of the patch is that this:

> @@ -353,7 +352,7 @@ static int cp_new_stat(struct kstat *sta
> #endif
>
> INIT_STRUCT_STAT_PADDING(tmp);
> - tmp.st_dev = encode_dev(stat->dev);
> + tmp.st_dev = new_encode_dev(stat->dev);

completely changes the format of that st_dev field.

For completely insane historical reasons, we have had the rule that

- 32-bit architectures encode the device into a 16 bit value

- 64-bit architectures encode the device number into a 32 bit value

and that has been true *despite* the fact that the actual "st_dev"
field has been 32-bit and 64-bit respectively since 2003!

And it doesn't help that to confuse things even more, the _naming_ of
those "encode_dev" functions is "old and new", so that logically you'd
think that "cp_new_stat()" would use "new_encode_dev()". Nope.

So on 32-bit architectures, cp_new_stat() uses "old_encode_dev()",
which historically put the minor number in bits 0..7, and the major
number in bits 8..15.

End result: on a 32-bit system (or in the compat syscall mode),
changing to new_encode_dev() would confuse anybody (like just "ls -l
/dev") that uses that old stat call and tries to print out major/minor
numbers.

Now,. the good news is that

(a) nobody should use that old stat call, since the new world order
is called "stat64" and has been for a loooong time - also since at
least 2003)

(b) we could just hide the bits in upper bits instead.

So what I suggest we do is to make old_encode_dev() put the minor bits
in bits 0..7 _and_ 16..23, and the major bits in 8..15 _and_ 24..32.

And then the -EOVERFLOW should be something like

unsigned int st_dev = encode_dev(stat->dev);
tmp.st_dev = st_dev;
if (st_dev != tmp.st_dev)
return -EOVERFLOW;

for the lcase that tmp.st_dev is actually 16-bit (ie the compat case
for some architecture where the padding wasn't there?)

NOTE: That will still screw up 'ls -l' output, but only for the
devices that previously would have returned -EOVERFLOW.

And it will make anybopdy who does that "stat1->st_dev ==
stat2->st_dev && ino == ino2" thing for testing "same inode" work just
fine.

Linus