Re: A New/Old code Just For Fun

On Jul 23, 11:06 am, Paulo Marques <pmarq...@xxxxxxxxxxxx> wrote:

It should be something like: sum_for_all_words(frequency * code_letters)
/ sum_for_all_words(frequency). I.e. the total number of letters used to
encode the corpus divided by to total number of words. This should give
the average letters per word used to encode the complete corpus.

If you need help debugging the code, you can send it to me privately.
I'm usually good at spotting other people's bugs. I just wish I could
use that superpower for my own programs :(

I found it. In my recursive display function I misplaced one line of
code so I was taking "strlen( tag )* lpScan->weight" before I appended
the final letter for this branch to the tag, so I ended up counting
the length of all the tags as one less than they should have been. I
fixed that and got:

Total Words 1075617
Total Letters 2321741
Average letters per word 2.16

SO overall, Huffman gets 2.16 vs my hand-made 2.35, or 8% improvement.

But more important, I learned a lot by doing this exercise. :)

BTW: as an alternative for making pronounceable codes, I discovered
the best approach is to build codes purely out of consonants, and then
add any old vowels you please when you use the codes. The human ear is
better at picking harmonious vowels than any program could be. So PTN
could be pronounced "patuma", or "aputiamu", or whatever you like,
without disturbing the self-segregating property. Adding a few rules
like "X" = "sh", "C" = "ch", and "Q"="th" makes even oddballs like XQ,
and CCN easy: "shathu", "chachani". Move over Apache Code Talkers. You
have met your match. :)