Re: Generate a one-time pad from say a 256bit key?



ggr@xxxxxxxxxxxx (Gregory G Rose) writes:

In article <1154937437.825861.65700@xxxxxxxxxxxxxxxxxxxxxxxxxxxx>,
TC <gg.20.keen4some@xxxxxxxxxxxxxxx> wrote:
<
<Gregory G Rose wrote:
<.
<>Two-byte Unicode characters can be considered as if the first byte
<> specifies the page of a book of characters, and the second byte
<> specifies the character within the page. Hebrew and Arabic are on
<> different pages, and (IIRC) not all of the character spaces are filled in.
<> So, for every pair of bytes of the ciphertext, there is a slightly higher
<> than random chance that they were actually enciphered by XORing
<> with two consecutive zero bytes... that is, they are actually plaintext.
<
<> So, for each pair of ciphertext bytes, classify them into one of three
<> categories:
<> 1. looks like valid Arabic character
<> 2. looks like valid Hebrew character
<> 3. everything else.
<>
<> Category 3 will swamp the other two categories, but with enough ciphertext,
<> and assuming the plaintext was indeed one of the two character sets, either
<> category 1 or 2 will start to dominate more than can be explained by random
<> chance. As soon as the probability of their deviance becomes statistically
<> significant to, say, 99.9%, you have your answer.
<
<I don't get it.
<
<Even if there were squintillions of 1's, and no 2's at all, how could
<you conclude from this, that the plaintext is in Arabic?
<
<AFAICS, you could only conclude that *if* the plaintext *is* in either
<Arabic or Hebrew, *then*, it is more likely to be in Arabic than in
<Hebrew.
<
<But that's a long way from saying that you can exploit the bias to tell
<what language the plaintext is in. What if the plaintext was in German,
<and German happens to have lots of 1's and not many 2's?

All I claimed originally was that you'd be able to
distinguish between Hebrew and Arabic... reading
back I guess I could have phrased it better, since
you think I was claiming that one could determine
the language a priori. That wasn't what I meant,
although it's still at least feasible. The bias
toward zero bytes from RC4 would allow you to build up a
frequency histogram that would eventually begin to
resemble the classical English (or German)
frequency tables.

I still have difficulty. From the small bias toward zero ( is it a bias
toward zero, or a bias toward "if one is zero then the next will also be
zero), I still find it difficult to believe that the any such frequency
histogram would show anything since the bias would be buried in the natural
variability of the languages.

And the difference between languages lies not so much in the relative
frequencies of the different letters, but in the distribution of digrams
and tridrams in the language. And those distributions are highly dependent
on the texts. They will be different in Shakespeare from in a treatise on
botany. Ie, again I believe that the natural variability of the language
will swamp the biases. In teh case of unicode
Arabic vs Hebrew, the zero entropy in the first code page character would
of course let you distinguish since it has zero entropy. If you know some
feature has zero entropy, then small changes in the bias entropy can
eventually be distinguished.



But it's always easier, statistically, to test
between two hypotheses (eg. German vs. not German,
or Hebrew vs. Arabic) than to try to classify the
input.

.



Relevant Pages

  • Re: Admired designs / designs to study
    ... instructions to try to pack together a whole word ... resort to using one of the string or bit move ... Get the next character from the ... if the number isn't zero yet. ...
    (comp.arch)
  • Re: Generate a one-time pad from say a 256bit key?
    ... was arabic rathr than Hebrew (that being the most likely laptop to be found ... The question is whether or not that difference in Unicode character sets ... therer is still a large randomness in it and randomness in that ...
    (sci.crypt)
  • Re: Generate a one-time pad from say a 256bit key?
    ... The question is whether or not that difference in Unicode character sets ... between arabic and hebrew. ... that in English the 8th bit has zero entropy per bit. ...
    (sci.crypt)
  • Re: Word XP - Strikethrough Font & More oops
    ... There are fonts with this type of character, but they're not that easy to ... after the zero. ... Word MVP FAQ site: http://word.mvps.org ... >>> Mary Sauer MS MVP ...
    (microsoft.public.word.printingfonts)
  • Re: Free Space Monster (episode 2)
    ... an unprintable character ... The hex number 30 can be represented as the ASCII *character* 0 ... If instead I had just posted "zero", everyone here, including you, could ... Leave ASCII character representations out of it, ...
    (comp.sys.mac.apps)