[REVS] Defeating Voice Captchas



The following security advisory is sent to the securiteam mailing list, and can be found at the SecuriTeam web site: http://www.securiteam.com
- - promotion

The SecuriTeam alerts list - Free, Accurate, Independent.

Get your security news from a reliable source.
http://www.securiteam.com/mailinglist.html

- - - - - - - - -



Defeating Voice Captchas
------------------------------------------------------------------------


SUMMARY

Jochem van der Vorm has written a elegant paper and proof of concept to
how spammer could attack the voice based captchas being utilized in
various web sites.

DETAILS

Introduction:
For some years semi turing tests under the name of 'captchas' can be found
on the web, to prevent bots from filling in forms. When Jochem first saw
the visual variant he thought recognizing the characters with a computer
algorithm should be easy. A bit of surfing and searching on the Internet
learned him that he was right, most were broken already. Reinventing the
wheel is not very useful, so he left the topic alone.

Later Jochem found a post about voice captchas. Since there was not too
much information about this on the net and was bored (ill at home), he
decided to give it a shot. Jochem started easy, willing to enhance the
used algorithms to those used in speech recognition (like hmm, viterbi,
baum-welch, entropy coding, etc.) when needed. This proved not to be
necessary, the first feature complete (segmentation and matching) code
worked relatively well on Microsoft's captchas. Later he tweaked it a bit
to also work on Google captchas.

On this page you can find proof of concept code to break voice captchas.
Do not expect advanced software (pattern recognition science is so much
further) or code that can be used in other projects, Jochem quited the
project when it worked. Initially (February 2006) he kept the code on his
harddisk, but later (may 2006) he published it (see disclosure
motivation).

How does it work:
This is not a complete guide, but some pointers to the source (read it
luke). As a starting point, consider the configtype struct:

typedef struct {
int samplerate;
int byterate;
int winsize;
int band_cnt;
int word_length;
int word_overlap;
int threshold_energy;
int file_offset;
char trainfile[255];
} configtype;

The program starts with reading the audio file (in the header it could
read the samplerate and byterate). file_offset bytes are skipped in the
beginning of the file, because Google starts with a bell. The first step
is that all samples are treated with a hamming window (arbitrary choice,
most window types should do). The winsize is in samples (eg 512 samples on
8000 Hz provides a 64 ms window). Now the blocks are transformed into the
frequency domain with a DFT After that the frequencies are put in band_cnt
bins. These bins are not equally wide, the higher the frequency, the
larger the band (this has to do with human
hearing (mel/bark scale), but Jochem doubts this is actually useful at the
current incarnation of the program).

Now the program looks at the highest frequency bin. Every block that has
more energy in a window than threshold_energy is considered a peak, and
these peaks are used the segment the input file in the different spoken
words. The word_length tells the program how many windows long a word is
(so all words are considered the same length which is a current weakness
of devoicecaptcha). word_overlap helps in localizing the peaks. When the
locations of the words are know all frequency bins are written for
word_length windows around the peaks. This is called the profile of the
word.

The profiles for know words are put in trainfile. When a guess has to be
made, the profiles for the words in the file are subtracted from those in
the trainfile and the smallest deviation is chosen as the proper word.
That is all.

The algoritms in devoicecaptcha are at this moment really naive. There are
a lot of possible improvements. Perhaps in the future Jochem will enhance
the program a bit, for now he thinks the 33% (as on Google's captchas) is
good enough.

Proof of concept:
The code is rather messy. Download
<http://vorm.net/downloads/devoicecaptcha.c> devoicecaptcha.c (see at the
bottom) and compile it with it:
gcc -lfftw3 -std=c99 devoicecaptcha.c

As you can see you need fftw, an allround Fourier transform library, which
is packaged for most distributions, so you can be lazy (apt-get install
fftw3-dev or similar).

When started with ./a.out captcha.wav you also need a data set (a msn one
and a Google one are available. If you have downloaded the same captchas
(see links) as Jochem has, it will print a guess on stdout.

As said before, devoicecaptcha works with a comparison to a trained set.
To build up a training set and test the effectiveness of various
parameters you can start devoicecaptcha with a third bogus argument, eg
/a.out captcha.wav --print.

What he did was download a large set of captchas with lwp and transcribed
them with the proper words with something like:
for i in google/*.wav; do aplay $i &> /dev/null &; read j; mv $i
google/$j.wav; done

He ended up with a directory with filenames like "123456.wav" where 123456
are the digits spoken in the captcha. On this directory he unleashed a
small ruby script, which splits the files in a training and testing set,
builds a training set and tests the rest. This script can be found under
train.rb.

MSN
MSN (passport) audio captchas are really weak. Only digits are used, there
are always ten digits and the noise is weak and constant. The distance
between the words is relatively constant. Devoicecaptcha guesses all ten
digits correct on around the 75% of all cases, with a training set of
about 40 files.

A data set which can be used for the English MSN (aka passport, aka MSN
live) voice captchas (which Jochem got from passport.net) can be
downloaded under the name msn.txt. It is also possible to create your own
training data (see above).

Google
Google's voice captchas are more difficult to break than the captchas by
Microsoft. Google employs different speakers, uses better noise artifacts
and a random number of words. The dictionary is (as Microsoft's) limited
to digits only. The devoicecaptcha program scores around 33% on these
voice captchas with a training set of 60 files. This is high enough for
use in a bot.

A data set which can be used for the Google captchas (in English, Google
also provides captchas in multiple languages) can be found under
google.txt. The files were found at Google new account.

These captchas are also in use by blogger and blogspot for comments

Others
If you know other voice captcha systems, let me know. Perhaps Jochem will
have some time to look into them (and perhaps not). Jochem will at least
add them to the links section on this page, so together with the provided
source other people can try to beat them.

Disclosure motivation:
Jochem did not release the source code on this page without hesitation,
because it might help spammers in their goals. And he hates spam. However
there are some reasons he released the code anyway:
* Jochem does not believe captchas actually work. Almost all visual
captchas are broken already (read for example Microsoft's paper on visual
captchas or ez-gimpy which can defeat the 'human insolvable' captchas at
yahoo). Or look at the versatile pwntcha. Although he does believe humans
can fool computers for quite some time in the future, he suspects that
computers can always beat computer generated challenges (without a
'secret' as in PKI).

* Spam is a human problem and he believes human problems should be solved
by humans. Legislation and law enforcement are in his opinion the best
ways to deal with spammers. If captchas did no harm anyone this would of
course not be enough reason for releasing this code, but

* Captchas make the web a more difficult place for disabled people (see
http://www.w3.org/ and more annoying for everyone. Jochem hopes the
community will be motivated to find other solutions (and he is happy that
the w3 organization cares for people with disabilities and a usable web
for everyone, including the deaf-blind).

* Jochem is a proponent of full disclosure.

Some people might ask what kind of solutions Jochem suggests for solving
the spam problem. Spamassasin catches thousands of spam mails for me; it
is expensive in cpu cycles (so putting spammers in jail is preferred), but
the multi-tiered approach (neural network detection together with several
lists of wrong-doers) works relatively well and can be applied to other
forms of spam.

Playing the cat/mouse game with more difficult captchas, when the previous
challenge is broken will work, but is not satisfactory in the end. Jochem
encountered more human unsolvable captchas everyday. Jochem understands
that corporations play this game however; in the real world thresholds do
help.

Links
Information
* <http://en.wikipedia.org/wiki/Captcha> Wikipedias entry on captchas is
a very good starting point on this topic (as usual) and contains links to
pages about breaking visual captchas

* <http://www.w3.org/TR/turingtest/> Inaccessibility of CAPTCHA by
w3.org

* <http://www.ceas.cc/papers-2005/160.pdf> Microsofts paper on visual
captchas

* <http://sam.zoy.org/pwntcha/> PWNtcha - captcha decoder by sam hocevar

Broken voice captchas
* <https://www.google.com/accounts/NewAccount> Google signup

* Microsofts <https://accountservices.passport.net/reg.srf> passport
service

* <http://www.blogger.com/> blogger and <http://blogspot.com/> blogspot
also use google voice captchas, so these are broken to.

Not yet broken voice captchas
* <http://www.notonebit.com/projects/killbot/kbaudio.php> Killbot with
audio (should be easy with standard voice recognition software)

* <http://www.standards-schmandards.com/index.php?2005/01/01/11-captcha>
Standards schmmandards proposal is adding random text like ("You now hear
numbers ..") and using a speech engine with different voices. Jochem
intuition tells him it is easy to beat, but the C# source is unusable on
his Linux box, so he cannot prove. Generated examples are welcome.

Source code:
/* devoicecaptcha.c, jochem, http://vorm.net/captchas */

#define PI 3.141592653
#define EXP 2.718281828

#include<string.h>
#include<stdlib.h>
#include<stdio.h>
#include<math.h>
#include<fftw3.h>

typedef struct {
int samplerate;
int byterate;
int winsize;
int band_cnt;
int word_length;
int word_overlap;
int threshold_energy;
int file_offset;
char trainfile[255];
} configtype;

static configtype msn = { 8000, 1, 256, 6, 10, 8, 8000, 46, "msn.txt"};
static configtype google = { 8000, 2, 512, 12, 8, 14, 2000000, 25500,
"google.txt"};

/* Hamming window */
void hamming(double* data, int winsz)
{
for(int i=0; i < winsz; ++i)
*(data+i) = (0.54-0.46*cos(2.0*PI*i/winsz))*(*(data+i));
}

/* Precalc 'semi-mel' band boundaries in samples */
void setup_mel(int* bands, int cnt, int ws, int fs)
{
double max_mel = 1127.01048*log(1.0+(fs/700.0));
for(int i=1; i <= cnt; ++i)
*(bands+i-1) = (int)
(700*ws*(pow(EXP,((i*max_mel)/cnt)/1127.01)-1))/fs;
}

/* Sum over band */
void sum_over_bands(double* data, int* bands, double *sum, int cnt)
{
int j = 1;
for(int i=0; i < cnt; i++) {
*(sum+i) = 0;
while(j < *(bands+i)) {
*(sum+i) += (double) abs(*(data+j));
j++;
}
/* Normalize */
if (i == 0) *(sum+i) = *(sum+i)/(*(bands+i));
else *(sum+i) = *(sum+i)/(*(bands+i)/(*(bands+i-1)));
}
}

/* Simple compare for qsort */
int compare (const void * a, const void * b) { return ( *(int*)a -
*(int*)b ); }

/* Naive peak detection, by comparing energy with threshold, max first */
/* using only info from band_idx */
int detect_peaks(double* values, int values_cnt, int** peaks, int width,
int threshold, int band_idx, int band_cnt)
{
int detected = 0;
int max = 0;
int max_loc;
/* Copy only band_idx data to prevent distorting values */
double* band = (double*) malloc(values_cnt*sizeof(double));
if (band == NULL) {
perror("Out of mem\n");
exit(1);
}
for(int i=0; i < values_cnt; i++)
*(band+i) = *(values+i*band_cnt+band_idx);
while(1) {
for (int i=0; i < values_cnt; i++) {
if (*(band+i) > max) {
max = *(band+i);
max_loc =i;
}
}
if (max < threshold) break;
max = 0;
detected++;
*peaks = realloc(*peaks, sizeof(int)*detected);
if (*peaks == NULL) {
perror("Out of mem\n");
exit(1);
}
*(*peaks+detected-1) = max_loc;
for(int j=max_loc-width/2; j < max_loc+width/2; j++)
if (j >= 0 && j < values_cnt)
*(band+j) = 0;
}
qsort(*peaks, detected, sizeof(int), compare);
free(band);
return detected;
}

/* Basic usage information */
void show_usage(char *s)
{
printf("Start with %s name.wav\n"
"to guess the words in name.wav.\n\n"
"Start with %s name.wav --print\n"
"to print a profile for name.wav thatcan be used as "
"training data\n", s,s);
exit(1);
}

/* Lazy open wav: - no header parsing */
void sniff_wavfile(char *s, configtype **cfg)
{
FILE *fp = fopen(s, "rb");
if (fp == NULL) {
perror("Could not open input file\n");
exit(1);
}
unsigned char buf[50];
if (! fread(buf, sizeof(buf), 1, fp)) {
perror("File to small\n");
exit(1);
}
if (buf[34] == 8)
*cfg = &msn;
else if (buf[34] == 16)
*cfg = &google;
else {
perror("Unknown input file\n");
exit(1);
}
fclose(fp);
}

/* Read file, window, dft and calc params */
int read_blocks(char* in, double** params, configtype *cfg)
{
int samples_read=0;
int blocks_read=0;

FILE *fp = fopen(in, "rb");
if (fp == NULL) {
perror("Could not open input file\n");
exit(1);
}
fseek(fp, cfg->file_offset, SEEK_SET);
/* Setup fftw and mel */
double* audio_t = (double*) malloc(cfg->winsize*sizeof(double));
double* audio_f = (double*) malloc(cfg->winsize*sizeof(double));
int *bands = (int*) malloc(cfg->band_cnt*sizeof(int));
if (audio_t == NULL || audio_f == NULL || bands == NULL) {
perror("Out of mem\n");
exit(1);
}
fftw_plan p = fftw_plan_r2r_1d(cfg->winsize, audio_t, audio_f,
FFTW_R2HC,
FFTW_ESTIMATE);
setup_mel(bands,cfg->band_cnt,cfg->winsize,cfg->samplerate);

/* Loop over file */
short buf;
while(! feof(fp)) {
if (cfg->byterate == 1) {
buf = fgetc(fp);
*(audio_t+(samples_read++ % cfg->winsize)) = (double) buf;
} else if (cfg->byterate == 2) {
if (fread(&buf,1, sizeof(short), fp) < sizeof(short))
break;
*(audio_t+(samples_read++ % cfg->winsize)) = (double) buf;
}
if((samples_read % cfg->winsize) == 0) {
*params = (double*) realloc(*params, (blocks_read+1)*
cfg->band_cnt*sizeof(double));
if (*params == NULL) {
perror("Out of mem\n");
exit(1);
}
hamming(audio_t,cfg->winsize);
fftw_execute(p);
sum_over_bands(audio_f, bands, (*(params)+blocks_read*
cfg->band_cnt), cfg->band_cnt);
blocks_read++;
}
} /* todo (perhaps ;-)) zeropad and handle last block */

/* Free/close */
fclose(fp);
free(audio_t);
free(audio_f);
free(bands);
fftw_destroy_plan(p);
return blocks_read;
}

/* Load trained set (messy) */
int load_trained(char* filename, int** profiles, int** words, int n_bands)
{
int digit;
int n_digits = 0;
char buf[255];
int lines = 0;
FILE* trained = fopen(filename, "r");
if (trained == NULL) {
perror("Could not open trained set\n");
exit(1);
} else {
while(fgets(buf, 255, trained) != NULL) {
if (strstr(buf, ":") != NULL) {
if (sscanf(buf, "%d",&digit)) {
n_digits++;
*words = (int*) realloc(*words,n_digits*sizeof(int));
if (*words == NULL) {
perror("Out of mem\n");
exit(1);
}
*(*words+n_digits-1) = digit;
}
} else {
lines++;
*profiles = (int*)
realloc(*profiles,lines*n_bands*sizeof(int));
if (*profiles == NULL) {
perror("Out of mem\n");
exit(1);
}
char* t = strtok(buf, " ");
int cnt = 0;
while(t!= NULL) {
if (++cnt > n_bands) break;
*(*profiles+cnt+(lines-1)*n_bands-1) = atoi(t);
t = strtok(NULL, " ");
}
if (cnt < n_bands+1) {
perror("Not enough columns in input\n");
exit(1);
}
}
}
fclose(trained);
}
return n_digits;
}

/* Get params around peaks only */
void separate_word_data(int* p, double* all, int* peaks, int n, int
word_cnt,
configtype* cfg)
{
int *p_ptr = p;
for(int i=0; i < word_cnt; i++) {
int loc = peaks[i];
for(int i=loc-cfg->word_length/2; i < loc+cfg->word_length/2; i++)
{
if (i > 0 && i < n) {
for(int j=0; j < cfg->band_cnt; j++) {
*(p_ptr) = (int) *(all+i*cfg->band_cnt+j);
p_ptr++;
}
} else if(i < 0) {
/* fixme: this is hack for digits starting too early*/
for(int j=0; j < cfg->band_cnt; j++) {
*(p_ptr) = (int) *(all+j);
p_ptr++;
}
} else {
/* or late */
for(int j=0; j < cfg->band_cnt; j++) {
*(p_ptr) = (int) *(all+j+n-1);
p_ptr++;
}
}
}
}
}

/* Print param profiles for seperated digits */
void print_params(int* params, int maxx, int maxy, int maxz)
{
for(int i=0; i < maxx; i++) {
for(int j=0; j < maxy; j++) {
for(int k=0; k < maxz; k++) {
printf("%d ", *(params+i*maxy*maxz+j*maxz+k));
}
printf("\n");
}
printf("\n");
}
}

int closed_match(int* word_data, int word_idx, int* trained_data,
int trained_sz, configtype *cfg)
{
int sum;
int pos;
int it_lowest=999999;

int* t_band = malloc(sizeof(int)*cfg->band_cnt);
int* t_mean = malloc(sizeof(int)*cfg->band_cnt);
if (t_mean == NULL || t_band ==NULL) {
perror("Out of mem\n");
exit(1);
}
for (int i=0; i < trained_sz; i++) {
memset(t_band, 0, sizeof(int)*cfg->band_cnt);
memset(t_mean, 0, sizeof(int)*cfg->band_cnt);
for (int j=0; j < cfg->word_length; j++) {
for (int k=0; k < cfg->band_cnt; k++) {
int orig =
*(trained_data+i*cfg->word_length*cfg->band_cnt+
j*cfg->band_cnt+k);
int test = *(word_data+j*cfg->band_cnt+k+word_idx*
cfg->word_length*cfg->band_cnt);
*(t_band+k) += abs(orig-test);
*(t_mean+k) += orig;
}
}
sum = 0;
int borked = 0;
for (int l=0; l < cfg->band_cnt; l++) {
if (*(t_mean+l) != 0)
sum += (*(t_band+l)*cfg->word_length)/(*(t_mean+l));
else
borked = 1;
}
if (sum < it_lowest && ! borked) {
it_lowest = sum;
pos = i;
}
}
free(t_mean);
free(t_band);
return pos;
}

int main(int argc, char** argv)
{
if (argc != 2 && argc !=3)
show_usage(argv[0]);

/* Sniff filetype and set config for that type */
configtype *cfg;
sniff_wavfile(argv[1],&cfg);

/* Read file and give parames per winsize blocks */
double* params = malloc(0);
int blocks_read = read_blocks(argv[1], &params, cfg);

/* Calculate positions of separate words */
int* peak_locations = (int*) malloc(0);
int word_cnt = detect_peaks(params, blocks_read, &peak_locations,
cfg->word_length+
cfg->word_overlap,
cfg->threshold_energy,cfg->band_cnt-1,cfg->band_cnt);

/* Only interested in data near peaks, put in words */
int* words = (int*) malloc(word_cnt*cfg->word_length*cfg->band_cnt*
sizeof(int));
if (words == NULL) {
perror("Out of mem\n");
exit(1);
}
separate_word_data(words, params, peak_locations, blocks_read,
word_cnt, cfg);
free(params);

/* Print profile for training or lookup in trained set */
if (argc == 2) {
printf("Use training data from: %s\n", cfg->trainfile);
printf("Detected %d words\n", word_cnt);
printf("Guess is: ");

/* Load trained datafile */
int* trained_params = (int*) malloc(0);
int* trained_ids = (int*) malloc(0);
int t_size = load_trained(cfg->trainfile, &trained_params,
&trained_ids,
cfg->band_cnt);

/* Now compare params with trained data per word*/
for (int m=0; m < word_cnt; m++) {
int match_idx = closed_match(words, m, trained_params, t_size,
cfg);
printf("%d", *(trained_ids+match_idx));
}
printf("\n");

/* Freeing */
free(trained_params);
free(trained_ids);
} else if (argc == 3) {
print_params(words, word_cnt, cfg->word_length, cfg->band_cnt);
} else show_usage(argv[0]);

/* Free and close stuff */
free(words);
exit(0);
}


ADDITIONAL INFORMATION

The information has been provided by <mailto:jochem@xxxxxxxx> Jochem van
der Vorm.
The original article can be found at: <http://vorm.net/captchas/>
http://vorm.net/captchas/



========================================


This bulletin is sent to members of the SecuriTeam mailing list.
To unsubscribe from the list, send mail with an empty subject line and body to: list-unsubscribe@xxxxxxxxxxxxxx
In order to subscribe to the mailing list, simply forward this email to: list-subscribe@xxxxxxxxxxxxxx


====================
====================

DISCLAIMER:
The information in this bulletin is provided "AS IS" without warranty of any kind.
In no event shall we be liable for any damages whatsoever including direct, indirect, incidental, consequential, loss of business profits or special damages.



Relevant Pages