Re: All known english words
From: Unruh (unruh-spam_at_physics.ubc.ca)
Date: 09/11/05
- Next message: Unruh: "Re: All known english words"
- Previous message: tomstdenis_at_gmail.com: "Re: Advice needed regarding SHA0 SHA1 MD5 - threadbreak - yet another threadbreak"
- In reply to: Moe Trin: "Re: All known english words"
- Next in thread: Mike Amling: "Re: All known english words"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Date: 11 Sep 2005 13:39:36 GMT
ibuprofin@painkiller.example.tld (Moe Trin) writes:
>In the Usenet newsgroup sci.crypt, in article
><dfusvs$ocn$5@nntp.itservices.ubc.ca>, Unruh wrote:
>>>> The working vocabulary of most people is of the order of 2000 words,
>Can you cite?
Thanks for answering this question below.
>>>2000?
>>
>>>I believe that is quite low.
>>
>>>I have heard an educated person has about 20k word vocabulary.
>>That is not working vocabulary. That is recognition. Working is much less
>>than that. And what is your definition of "educated"? A person whose
>>recognition vocabulary is large?
>Yes, I know what you mean about the 'recognition' verses the 'working'
>numbers, but where would you put the posters to this newsgroup?
a very disperate bunch. Also a bunch that frequently misspells words.
>[compton ~]$ grep -vh '^>' /var/spool/news/news/sci/crypt/* | grep -v
>'^[A-Z][0-z-]*:' | tr ' ' '\n' | grep -v '^[0-9]' > /tmp/FOO ; wc -l
>/tmp/FOO ; sort -ub < /tmp/FOO | wc -l
> 26038 /tmp/FOO
> 5468
>[compton ~]$
Did you do a sanity check on the resulting list? Is 's added to a word a
different word? Is _ prepended and appended to a word a different word (see
your own post.) How many of the "words" (delimited by spaces) of the above
line were regarded as different words by your count? Is wc a word in
english? Is -l? Is FOO? or tmp or [A-Z]. I would regard none of those as
words in English. They are symbols in a computer language.
>If you can't follow that - the spool contains 222 articles (about six
>days, less what my killfile took out). The first grep removes lines with
>the normal '>' used for quoting, the second takes out lines that begin
>with a capital letter, and have only letters, numbers and some punctuation
>before a colon - the normal headers. The result is then piped through
>tr to make it one word per line, words that _begin_ with a digit are
>removed, and the result stuck into a temporary file. The file then is
>checked for a word count (26038 words), and then sorted into unique
>words - 5468. Crude, but an indication people here are using more than
>2000 word vocabularies.
>As a cross check, I wrote a similar script to check the number of words
>and the variation in the raw posts (text file before I hand them to the
>news tool) of mine over the past seven days. The result: 14186 words,
>4098 _different_ words.
>I don't know about you, but I've worked a number of years overseas, and
>I actually do try to _limit_ the words that I use in Usenet - another of
>these places where English is common, but is not the first language of
>the majority of readers. The variation is increased because I'm writing
>of technical subjects, but still isn't that far out of line.
>>A 2000 word working vocabulary makes you literate.
>Another google search "VOA Special+English". Very briefly, starting in
>1959, the Voice of America started broadcasting the news in what they called
>"Special English". The web pages I've found don't mention it, but I recall
>using just a thousand words in the early 1960s. One of the web pages you
>find searching for the above keywords says:
> VOA Special English is a simplified English language used by Voice of
> America in daily broadcast. The news is read slowly and using a
> limited wordlist of about 1500 words. Charles Kay Ogden recommended
> radio news be given in Basic English with the appropriate Basic
> special radio vocabulary add-on. VOA Special English, although
> intended for telling news around the world, has the additional
> benefit of allowing English to be learned and for pronunciation to be
> polished.
>I guess there are varying standards of literacy. ;-)
1500 is less than 2000.
>>And how many of Shakespeare's words do you even recognize?
>Seeing as how I haven't read any Shakespeare since... the 1950s (probably
>1957 or 58), probably not that much off the top of the head, but I very
>likely recognize more of his words than he would recognize of mine ;-)
I suspect not.
- Next message: Unruh: "Re: All known english words"
- Previous message: tomstdenis_at_gmail.com: "Re: Advice needed regarding SHA0 SHA1 MD5 - threadbreak - yet another threadbreak"
- In reply to: Moe Trin: "Re: All known english words"
- Next in thread: Mike Amling: "Re: All known english words"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Relevant Pages
|