4GB Processes in Solaris


Abstract: Although Solaris provides a 32-bit address space, programs encounter minor problems and restrictions beyond 2GB. This application note demonstrates the techniques needed to exploit the 4GB address space.



Introduction

Until the release of a 64-bit version of Solaris, applications running on Sun machines will have to work within the 32-bit address space. While this is large enough for the vast majority of programs, a few have already reached this boundary.

Unfortunately, it currently isn't completely straight-forward to reach the nearly 4GB limit that the system provides. Instead, many programs run into barriers at 2GB when they attempt to grow by using malloc() (and indirectly, sbrk()).

The purpose of this application note is to demonstrate how to exploit the entire virtual address space.

A Few Words About Virtual Memory

In Solaris 2.X (unlike SunOS 4.X), the available virtual memory for a system is the sum of physical memory space and swap space. So the first step to exceeding the 2GB VM barrier is to provide more than 2GB of memory+swap.

Although there are undoubtedly more elegant ways of checking your available VM, I abuse the knowledge that "/tmp" is a memory based filesystem to indirectly measure VM. Specifically, the amount of available VM is roughly equal to the available /tmp space, thus:

    % df -k /tmp
    Filesystem            kbytes    used   avail capacity  Mounted on
    swap                 2407144    3536 2403608     1%    /tmp
shows that this system has 2.4GB of available VM.

Different Strategies for Different Solaris releases

With respect to accessing the upper 2GB of address space, Solaris has been improved with each successive release. Because of these changes, there are 4 distinct sets of constraints placed on an application (with each newer release becoming easier to work with).

As a software vendor, you can trade off your development costs against the age of the OS you want to support and the requirements you want to place on your user.

The following table summarizes the changes and points to the specific actions needed to use VM beyond 2GB.



Solaris 2.6

If your application is only approaching 2GB of VM, the best approach may be to do nothing. With this latest OS release, an existing binary (without changes or recompilation) can use malloc() to grow the process to approximately 3.75GB. Furthermore, the user isn't required to do anything special except to issue a limit (or ulimit) command to increase their allowed data space:
    % limit datasize 4194303
    % limit datasize
    datasize      4194303 kbytes

[MPI: We tried to use this magic number of 4194303 and always got error messages like the following:]
    # limit datasize 4194303
    limit: datasize: Can't set limit
Trying other magic numbers showed us a different limit:
    # limit datasize 3932152
    limit: datasize: Can't set limit
    # limit datasize 3932151
    #
[Note that although the issue is the virtual memory limit, malloc grows a process via the data segment and the datasize limit is actually the gating factor. For logical consistency you might also increase your memorysize limit, but Solaris doesn't appear to consult this limit when growing a process via malloc().]

Beware that the standard shells have a peculiar definition of the term unlimited. In particular, if you check your resource limits, ie:

    % limit datasize
    datasize      unlimited

you might not realize that, for example, datasize is actually limited to 2GB (by default). Fortunately, these commands only lie if the limit is exactly 2GB and will report accurately if you set your limit to something higher. If you run the showlimits program from Appendix A, you can see the actual limits:


    % ./showlimits
    Current/maximum data limit is   2147479552 / 2147479552
    Current/maximum stack limit is  8388608 / 2147479552
    Current/maximum vmem limit is   2147483647 / 2147483647

If you're currently developing code which will eventually run on Solaris 2.6, it would be a convenience to your users if you include code which attempts to automatically increase the data limit. In order for it to be compatible with previous releases of the OS (which require root permissions to increase these limits), you might consider letting it silently ignore a failure. Example code might be:


    struct rlimit datalim;
    /* Select some large value as the upper data limit */
    datalim.rlim_cur = datalim.rlim_max = 3500000000UL;
    /* Ignore failure in order to be compatible with 2.5.1 */
    (void)setrlimit(RLIMIT_DATA, &datalim);

Solaris 2.5.1 with patch 103640 (version -08 or newer)

If you're faced with this issue under Solaris 2.5.1, the easiest solution is to patch the OS with patch 103640 (and its prerequisite patch, 103600). This patch incorporates the improvements to allow the data segment to grow beyond 2GB, with the caveat that this OS still requires root permissions to increase the data resource limit.

To increase this limit either requires root intervention for a command sequence like:

    % su root
    passwd:
    # limit datasize 3000000
    # su - username
    %
or more likely, the use of a setuid program which accomplishes the same thing (see Appendix B).

Solaris 2.5.1 (generic)

If for some reason your customers are unable to patch their OS or are unwilling to tolerate root intervention to increase their resource limits, you would be forced to use mmap() to increase your memory usage. The idea behind this work-around is to allocate additional memory segments and use them in conjunction with the standard data segemnt. This could be hidden from the application with a modified version of malloc(), or it might be done by the application if there is a higher-level memory management layer on top of malloc(). See Appendix C for an example program which demonstrates this technique.

Unlike the previous examples, since this technique does NOT grow the process via the data segment, you will be required to increase the memorysize resource limit (which IS checked by the OS when you attempt to perform the mmap()).

Note that Solaris does supply an alternative version of malloc() which can take advantage of multiple memory mappings. See the mapmalloc(3x) man page for details, but note the caveat which warns:

Solaris 2.5

If you absolutely need to support older installations which can't be upgraded to 2.5.1, you need to use the work-around referenced in the previous section. In addition, your end user will need to have root permissions to increase his VM resource limit. As with the data resource limit mentioned previously, this could be done by root with shell level commands or by a setuid wrapper program. This wrapper would be identical to the example given in Appendix B except that the resource limit to be modified would be RLIMIT_VMEM instead of RLIMIT_DATA.



Patch 103640 footnote

Patch 103640 (revision -08 or newer) is externally available at http://sunsolve1.sun.com/sunsolve and internally at Sun at: http://sunsolve.corp/sunsolvei/patchpages/os-5.5.1.html

Also, this patch requires the installation of at least revision -03 of patch 103600.


Upper VM limit footnote

You may be wondering why the VM limit is nearly 4GB instead of exactly 4GB. The reason is that, historically, the user space and kernel space were mapped from the same 4GB address range. This meant that kernel addresses were partitioned above, for example, 0xF0000000 and user addresses were below. With the introduction of the UltraSPARC hardware, this sharing of addresses was no longer necessary, but the work necessary to remove this assumption from the kernel just didn't appear to be worth the 6% address space increase.


Appendix A: Examining current resource limits

To determine the current resource limits, compile and execute the following program, showlimits.c:

    #include <stdio.h>
    #include <stdlib.h>
    #include <stddef.h>
    #include <sys/time.h>
    #include <sys/resource.h>
     
    static void
    showlimit(int resource, char* str)
    {
        struct rlimit lim;
        if (getrlimit(resource, &lim) != 0) {
            (void)printf("Couldn't retrieve %s limit\n", str);
            return;
        }
        (void)printf("Current/maximum %s limit is \t%lu / %lu\n", 
                     str, lim.rlim_cur, lim.rlim_max);
    }
     
    /*ARGSUSED*/
    int
    main(int argc, char*argv[])
    {
        showlimit(RLIMIT_DATA,  "data");
        showlimit(RLIMIT_STACK, "stack");
        showlimit(RLIMIT_VMEM,  "vmem");
     
        return 0;
    }


Appendix B: Wrapper to increase data limit in 2.5.1

The following program is an example of a wrapper program which could be used on Solaris 2.5.1 to increase the data segment resource limit.

It can be compiled and used as follows:

    $  cc -o wrapper wrapper.c
    $  su root
    #  chmod 4755 wrapper
    #  chown root wrapper
    #  ^D
    $  wrapper
    Initial limits are:
        rlim_cur = 2147483647
        rlim_max = 2147483647
    Updated limits are:
        rlim_cur = 3221225472
        rlim_max = 3221225472

Note that once the data limit is increased beyond 2GB, the shell built-in commands will report the correct size:

    % limit datasize
    datasize  3145728 kbytes

Note that for reasons of simplicity, it sets the data limit to an arbitrary value of 3GB. Presumably something more elegant could be arranged for production code. The source for wrapper.c follows.


    #include <sys/time.h>
    #include <sys/resource.h>
    #include <sys/types.h>
    #include <unistd.h>
    #include <stdio.h>
    #include <stdlib.h>
    
    /*
     *  Just an arbitrary large number for example purposes.
     */
    #define THREE_GB (3221225472UL)
    
    /*ARGSUSED*/
    int
    main(int argc, char *argv[], char*env[])
    {
        struct rlimit lim;
        unsigned long target_vm_limit = THREE_GB;
     
        /*
         *  Check and print the current virtual memory limit
         */
        if (getrlimit(RLIMIT_DATA, &lim)  == -1)
            perror("getrlimit"), exit(-1);
        (void)printf("Initial limits are:\n    rlim_cur = %lu\n    rlim_max = %lu\n",
            lim.rlim_cur, lim.rlim_max);
    
        /*
         *  Set the hard and soft limits to the target value
         */
        lim.rlim_cur = lim.rlim_max = target_vm_limit;
        if (setrlimit(RLIMIT_DATA, &lim) == -1)
            perror("setrlimit"), exit(-1);
    
        /*
         *  Check that it all worked.
         */
        if (getrlimit(RLIMIT_DATA, &lim)  == -1)
            perror("getrlimit"), exit(-1);
        (void)printf("Updated limits are:\n    rlim_cur = %lu\n    rlim_max = %lu\n",
            lim.rlim_cur, lim.rlim_max);
    
        /*
         *  Do NOT run the application with "root" permissions.
         *  Revert to the normal user id.
         */
        if (setuid(getuid()) == -1)
            perror("setuid"), exit(-1);
    
        /*
         *  At this point, a "wrapper" program should exec() the appropriate
         *  application with normal user permissions.  For demonstration
         *  purposes, however, this example will instead invoke a shell with
         *  the new resource limits.
         *  
         *  Note that if it simply called "return", the invoking
         *  shell would still have its original limits.
         */
        (void)execve("/bin/ksh", argv, env);
        perror("execve");
        return -1;
    }


Appendix C: Demonstrate the mmap() workaround

To demonstrate this workaround, the following program will allocate as much memory as it can via malloc() and then allocate more in another memory segment via mmap(). To absolutely prove that the memory is really available, the demo program will then write into every allocated page.

While the program is running, you can inspect the layout of the memory segments as follows:

    $ cc -o demo demo.c

    $ df -k /tmp   # Verify that there is sufficient swap space
    Filesystem            kbytes    used   avail capacity  Mounted on
    swap                 3429560      80 3429480     1%    /tmp

    $ demo > demo.out 2>&1
    [2] 1268

    $ /usr/proc/bin/pmap 1268
    1268:   demo
    00010000    8K read/exec          demo2
    00020000    8K read/write/exec    demo2
    000220002050784K     [ heap ]
    000220002050784K read/write/exec
    ADC000001074224K read/write
    EF6F0000   16K read/exec          /usr/platform/SUNW,Ultra-1/lib/libc_psr.so.1
    EF700000  544K read/exec          /usr/lib/libc.so.1
    EF796000   40K read/write/exec    /usr/lib/libc.so.1
    EF7A0000    8K read/write/exec
    EF7C0000    8K read/exec/shared   /usr/lib/libdl.so.1
    EF7D0000   88K read/exec          /usr/lib/ld.so.1
    EF7F0000    8K read/write/exec    /usr/lib/ld.so.1
    EFFFC000   16K read/write/exec
    EFFFC000   16K     [ stack ]

If you can ignore the fact that the address and size columns ran together, you'll see that the 4th entry (labelled "[heap]") is 2050784KB and the 6th entry (which is unlabelled because it is anonymous memory) is 1074224KB for a total allocation of slightly more than 3GB.

The source code for demo.c is:


    #include <stdio.h>
    #include <stdlib.h>
    #include <sys/types.h>
    #include <sys/mman.h>
    #include <fcntl.h>
    #include <unistd.h>
    
    #define ALLOCSIZE 3200000000
    #define DELTASIZE   50000000
    
    int
    main(int argc, char *argv[])
    {
        int fd, i;
        char *malloc_space, *mmap_space;
        long pagesize = sysconf(_SC_PAGESIZE);
        unsigned long mmap_size, malloc_size = ALLOCSIZE;
        
    
        /*
         *  Allocate as much memory as possible via malloc
         */
        while ((malloc_space = (char*)malloc(malloc_size)) == NULL) {
            if (malloc_size < DELTASIZE)
                (void)fprintf(stderr, "malloc failed\n"), exit(-1);
            malloc_size -= DELTASIZE;
        }
        (void)fprintf(stderr, "malloc'd %lu bytes\n", malloc_size);
    
    
        /*
         *  Allocate enough extra via mmap to reach ALLOCSIZE bytes
         */
        mmap_size = ALLOCSIZE - malloc_size;
        if ((fd = open("/dev/zero", O_RDWR)) == -1)
            perror("open"), exit(-1);
        mmap_space = (void*)mmap((caddr_t) 0, 
                                 mmap_size, 
                                 (PROT_READ | PROT_WRITE),
                                 MAP_PRIVATE, 
                                 fd, 
                                 (off_t)0);
        if (mmap_space == MAP_FAILED)
                perror("mmap"), exit(-1);
        (void)close(fd);
        (void)fprintf(stderr, "mmap'd %lu bytes\n", mmap_size);
    
        /*
         *  Just to be thorough, test evey page of both allocations to make 
         *  absolutely sure that the memory was really allocated.  This will
         *  take a while.
         */
        (void)fprintf(stderr, "Testing the %lu malloc'd bytes ...\n", malloc_size);
        for (i=0; i<malloc_size; i+=pagesize)
            malloc_space[i] = i;
        (void)fprintf(stderr, "Testing the %lu mmap'd bytes ...\n", mmap_size);
        for (i=0; i<mmap_size; i+=pagesize)
            mmap_space[i] = i;
        (void)fprintf(stderr, "done\n");
        return 0;
    }

Appendix D: Common Problems

One issue to look out for is that a large stack limit can inhibit growing your data segment. Note that even if your stack hasn't grown to be large, the virtual memory space for it is reserved according to the limit value. If you have difficulty growing the heap past 2GB, check to make sure your VM isn't consumed by pre-allocated stack:

    % limit stack
    stacksize    2097152 kbytes

    ... no chance of acquiring more than 2GB of data space ...

    % limit stack 16000
    % limit stack
    stacksize      16000 kbytes

    ... 3+ GB should be available for data space ...



Send comments or questions to Morgan Herrington at morgan@computer.org. (12/04/97)