[PATCH] sector_t overflow in block layer

From: Andreas Dilger
Date: Thu May 18 2006 - 14:59:38 EST


On May 18, 2006 11:23 +0100, Stephen C. Tweedie wrote:
> On Wed, 2006-05-17 at 17:58 -0600, Andreas Dilger wrote:
> > > the next question then is should we fail mounting of any >2TB fs
> > > w/o CONFIG_LBD or it would be better to mount IFF the fs has
> > > no allocated blocks beyond 2TB?
> >
> > This could actually be the primary source of the > 2TB filesystem
> > corruption problem that I've seen reported occasionally. The question
> > is if CONFIG_LBD isn't enabled will the kernel silently truncate
> > block-layer offsets?
>
> It appears so, yes. bread() goes through submit_bh(), which converts
> blocknr to sector with:
>
> bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
>
> without checking for overflow.
>
> > This would seem to be a critical kernel bug that should be addressed in
> > the block subsystem to prevent use of block devices > 2TB if sector_t
> > is only 32 bits.
>
> I think the arguments are a little less strong about restricting block
> device access in such cases, but certainly filesystem access needs to be
> properly checked. We need to check the *filesystem* size, not the
> device size (they are usually the same but they don't strictly have to
> be), both on mount and on attempted resize.

At least the submit_bh() code should return an error in this case
regardless of who the caller is. If ext[23] and other filesystems
check this themselves at mount time that is of course also prudent.

Patch has not been more than compile tested but it is a pretty serious
problem dating back a long time so I'm posting it for review anyways.

There are several ways to check for the 64-bit overflow, the efficiency
of which depends on the architecture. We could alternately shift b_blocknr
by >> 41 or change the mask, for example. The primary concern is really
the 32-bit overflow and resulting silent and massive data corruption
when CONFIG_LBD isn't enabled.

It is coincidental that mke2fs will clobber the primary group's metadata
(bitmaps, inode table) when formatting such a filesystem, but since this
data is identical at format time (group 0 offsets and group 16384 offsets
modulo 2TB are the same) it is very hard to detect before corruption hits.

==================== buffer_overflow.diff ===============================
--- ./fs/buffer.c.orig 2006-05-18 12:15:45.000000000 -0600
+++ ./fs/buffer.c 2006-05-18 12:09:20.000000000 -0600
@@ -2758,12 +2784,32 @@ static int end_bio_bh_io_sync(struct bio
int submit_bh(int rw, struct buffer_head * bh)
{
struct bio *bio;
+ unsigned long long sector;
int ret = 0;

BUG_ON(!buffer_locked(bh));
BUG_ON(!buffer_mapped(bh));
BUG_ON(!bh->b_end_io);

+ /* Check if we overflow sector_t when computing the sector offset. */
+ sector = (unsigned long long)bh->b_blocknr * (bh->b_size >> 9);
+#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
+ if (unlikely(sector != (sector_t)sector))
+#else
+ if (unlikely(((bh->b_blocknr >> 32) * (bh->b_size >> 9)) >=
+ 0xffffffff00000000ULL))
+#endif
+ {
+ printk(KERN_ERR "IO past maximum addressable sector"
+#if !defined(CONFIG_LBD) && BITS_PER_LONG == 32
+ "- CONFIG_LBD not enabled"
+#endif
+ "\n");
+ buffer_io_error(bh);
+
+ return -EOVERFLOW;
+ }
+
if (buffer_ordered(bh) && (rw == WRITE))
rw = WRITE_BARRIER;

@@ -2780,7 +2828,7 @@ int submit_bh(int rw, struct buffer_head
*/
bio = bio_alloc(GFP_NOIO, 1);

- bio->bi_sector = bh->b_blocknr * (bh->b_size >> 9);
+ bio->bi_sector = sector;
bio->bi_bdev = bh->b_bdev;
bio->bi_io_vec[0].bv_page = bh->b_page;
bio->bi_io_vec[0].bv_len = bh->b_size;


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/