Re: 2.4 and 2GB swap partition limit

From: Matti Aarnio (matti.aarnio@zmailer.org)
Date: Mon Mar 05 2001 - 09:52:41 EST


On Mon, Mar 05, 2001 at 09:58:00AM +0100, Christoph Rohland wrote:
> Hi Matt,
>
> On Sun, 4 Mar 2001, Matt Domsch wrote:
> > My concern is that if there continues to be a 2GB swap
> > partition/file size limitation, and you can have (as currently
> > #defined) 8 swap partitions, you're limited to 16GB swap, which then
> > follows a max of 8GB RAM. We'd like to sell servers with 32GB or
> > 64GB RAM to customers who request such for their applications. Such
> > customers generally have no problem purchasing additional disks to
> > be used for swap, likely on a hardware RAID controller.
>
> I did think about that too and I also think the 2GB limit is not
> appropriate for the big servers. But I do not beleive that you need so
> much swap on these machines. If you drive a 32 GB machine so heavily
> into swap it is more busy finding the pages to swap than doing
> anything really interesting. (At least that's my experience)

        While 2.4 did grow the maximum filesizes to CACHEPAGESIZE*2G
        bytes by scaling pagecache contained file page offsets with
        the CACHEPAGESIZE -- which may or may not be the same as the
        machine PAGE_SIZE. It can be larger, though propably not smaller.

        For SWAP uses there is also similar 32 bit quantity (actually
        it is "unsigned long"), however some bits in the the swp_entry_t
        are being used to something usefull in the cache logic besides
        of the offset inside the file.

        Indeed looking at the source, include/asm-i386/pgtable.h
        defines following:

#define SWP_TYPE(x) (((x).val >> 1) & 0x3f)
#define SWP_OFFSET(x) ((x).val >> 8)
#define SWP_ENTRY(type, offset) ((swp_entry_t){ ((type) << 1)|((offset) << 8) })

        The i386 actually support up to 4*16 = 64 swap files (or partitions)
        with this SWP_TYPE() definition, while include/linux/swap.h does
        define MAX_SWAPFILES to be 8 ... If that were a pointer array
        to kmalloc()ed blocks, the limit could be much higher. Indeed
        I think this is the only *static* limit anywhere in the current
        swap code.

        Similarly it supports 2^24 PAGES of swap at i386 per file/partition.
        ( 16 million pages of 4k each = 64 GB -- should be enough ;) )
        ( That would require vmalloc() to allocate 32 MB block, though.
          That might not be possible at every occasion -> swapon may fail. )

        The more I read the documentation (= source and its comments),
        the more I am inclined to think that the beast *will* work with
        swap-partitions (and files!) larger than 2G.

        Stephen Tweedie did this 'SWAPSPACE2' work for 2.4 series, what
        he might tell ? Is it really just a matter of fixing the
                mkswap
        utility ? Was Stephen just conservative saying:
                "Don't go over 2G" (I haven't tested it)

        Reviewing thru the architectural definitions of these SWP_***()
        macroes, the shifts used for SWP_OFFSET seem to vary in between
        7-12 and for Alpha and MIPS64: 40. Indeed things are not very
        easy to understand with 64 bit architectures. It looks like
        those architectures use the low 32 bits of swp_entry_t for
        something, while most use at most couple of bits.

        Oh, even those 64-bit system seem to give at least 24 bits for
        PAGE_SIZE 'offset'. The lowest bitcount for 'offset' seems to
        be at s390 which gives "only" 2^20 * 4k pages, or 4 GB per
        swap file/partition. (SWP_OFFSET() shifts with 12, which is
        same as PAGE_SHIFT for the machine. Why SPARC64 uses PAGE_SHIFT
        in its own unique way, that I don't know.)

        Somehow I suspect that the makers of each architecture port have
        not quite understood what the swp_entry_t bits are used for, and
        have blindly presumed them to be related to PAGE_SIZE ...

> For 2.5 we could perhaps think about a new swapfile layout which
> allows bigger partitions.

        The format seems to be just fine.

> Greetings
> Christoph

/Matti Aarnio
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Wed Mar 07 2001 - 21:00:17 EST