Re: Encrypting Unicode – Using ASCII as a Surrogate Alphabet.



On Mar 9, 4:58 am, WTShaw <lure...@xxxxxxxxx> wrote:
On Mar 8, 11:10 am, austin.oby...@xxxxxxxxxxx wrote:





Programming languages are coming fully equipped these days with the
necessary library of Unicode characters pre-installed as an
enumeration-type data base.  It is quite easy to change any character
into its binary representation or its hexadecimal representation.  It
is possible also to render a character into its ‘ideograph’ and
display it on screen.  What is not possible however is to key in
characters of an exotic eastern language using an ASCII keyboard.

It is true to say that any keyboard of any language can be simulated
ad hoc as on-screen tool in say Microsoft Word but to this writer that
is not a viable means for mass encryption in a busy office that has to
communicate in large volume with China or Japan using CJK from Unicode
as an example.  Alice who lives in New York needs to check her work
for ordinary transcription errors.  She also needs to check the
semantics of her text to some extent and without at least a colloquial
understanding of the plaintext language of her message that is
impossible to do.  There is an obvious culture gap that needs to be
bridged.

To get to the point therefore, there has to be a clearly defined
interface between editing by an interpreter who knows the language and
encryption by Alice who doesn’t know the language that is satisfactory
to all concerned.  An apportioning of responsibilities in other words
is needed. That interface is achieved by transforming the
extraordinary language plaintext of a message in whatever language it
may be, into the corresponding physical string of hexadecimal
characters that represents the same language message in Unicode.  It
is the duty therefore of the sending entity (who is not Alice) to
present the message text to Alice as a string of hexadecimal numbers
and she will then deal with them as an objective file of hexadecimal
numbers only, however incoherent they may now be that will be read in
by the computer as an external file and enciphered by a stream cipher
into cipher text.

The crypto scheme being proposed here works as follows.

A message for encryption to any country is transformed first of all
into a long string of hexadecimal numbers from Unicode separated by
ordinary blank spaces ‘ ‘ or maybe ‘#’ signs.  For the purpose of
encryption, this long string of numbers is abstracted now as a string
of alphanumeric ‘words’ so that the elements instead of being the
active digits of a hexadecimal number, instead become inert characters
of alphanumeric words from the familiar alphabet of ASCII.  Although
they started life as hexadecimal numbers, the intrinsic value now of
these strings as numbers evaluated from the hex digits is ignored and
instead a string of hex numbers is treated as a string of words
composed of alphanumeric characters. A ‘word’ includes its separation
space between it and the next word. The iconic value in ASCII is the
only meaningful value of each ‘pseudo’ digit of any word, they are not
powers of 16 any more.  They have instead been transformed from active
integers into passive self-contained alphanumeric characters ready for
encryption.  Encryption means yet another transformation into cipher
text that comes next.

The next step is to use ASCII_Pad Cipher to encipher every
alphanumeric character of the entire string, including spaces, i.e.
reading in everything character by character and enciphering it all
into random cipher text in the manner described earlier for that
cipher.

 This cipher has been already described in earlier postings but just
as a reminder,

(Key + Plaintext) MOD 127 (say) = Cipher text.

See also, “ ASCII_Pad Cipher” and ASCII_Pad Cipher Source code” onhttp://www.adacrypt.com

Private comments.

An obvious criticism might be “there is cipher text expansion here”,
i.e. the hexadecimal string is possibly several times as large as the
plaintext string it represents.  That is not due to this encryption
method however but as fallout from Unicode that will occur anyway.  It
is jumping the gun to say that this is going to be true for all
languages in Unicode, it may not be true for a lot of ideographs in
CJK for instance that are composed of many glyphs and might even
result in cipher text compression instead.

One reader is saying why stop with hexadecimal- why not use the base
95 instead of base 16 and reap the benefits of doing this in reduced
cipher text or even negative cipher text expansion (compression) when
using CJK and its many glyphs that comprise an ideograph - this in the
writer’s view is clearly feasible, it requires worthwhile
investigation as the subject for an experimental crypto-lab workshop
but is well worth doing.  Note: base 95 is very salient in this
connection, such a base could utilize the ‘Pos’ and ‘Val’ attributes
of ASCII to evaluate any ‘digit’ per se in the cipher text for both
encryption and decryption purposes, these two operations are
attributes of an enumeration data type such as ASCII in modern
programming languages, it would be possible to use them only in
base_95 <= the subset of 95 characters of ASCII that comprise the
standard keyboard.

Using an extended number base, e.g. base 95 or whatever base works
best, can reduce the cipher text to less than one half of its previous
value, the latter is a benefit attributable entirely to the ploy of
using such an elevated number base.  From a practical point of view it
simply means an extra primary transformation of the hexadecimal string
from Unicode into base-95 or whatever before then abstracting the
result into alphanumeric data prior to encryption.

Such a scheme therefore would  use a customized alphabet initially
that is comprised of the sixteen digits of hexadecimal number base as
its primary encryption alphabet to change natural language characters
from all over the world into hexadecimal numbers, this to be done by
the interpreter person. It would next transform these hex numbers into
base-95 (say) , to be done by Alice before  then abstracting these
latter base-95 digits as alphanumeric ‘words’ in ASCII preparatory to
encryption , by Alice also.  The result should almost certainly be a
reduced amount of cipher text expansion and a proportionate saving in
cipher text transmission.

This cryptography is theoretically unbreakable on the back of being an
OTP derivative.

It is naïve to think that the change to all-round encrypting of
Unicode in commerce is going to be a simple one.  It is simplistic
also to think that working in Unicode will be facilitated just by an
easy transfer to block ciphers using binary digits.

The writer is intent on going for the jugular in terms of unbreakable
cryptography, AES and RSA are not in this class of theoretically
unbreakable crypto strength, whereas ASCII_Pad and Vector_Cipher both
are!

Unicode as a global standard is intended to be around for a long time
to come, AES may not be.  Suggesting that there is permanency in any
cipher that has only ‘practically unbreakable’ class is a figurative
‘house of cards’ by analogy and is not a reliable design platform for
long term projections for Unicode or indeed anything else in future
cryptography.

 At best, the scope of block ciphers is reduced considerably by going
over to Unicode.

Suggesting AES as a quick-fix transition to Unicode is bad judgment.

This ploy, i.e. using ASCII as a surrogate alphabet for Unicode is not
a dilution of the global character of Unicode –just a piece of
convenient methodology in the west.

- Adacrypt.

Since the most common characters of the keyboard are lower case. I've
used a 49 character set well in which beyond 47 keys, the first extra
is a space representation and the second is for a shift to UC of the
following LC key. So, with a slight increase in size, the whole 47
keys in both cases can be adequately represented, 49 being a square of
base 7 and having immediate possibilities to be translated efficiently
to several other bases.

For efficiencies sake for the originators of computers, unicode is the
tail that tries to wag the dog.  For universal syntax and convenience,
learn a more organized language that was consciously developed with
simplicity in representation in mind, or one at least that does not
become practically unlearnable because of needless complexity for even
its native users.  There are other reasons but I'd rather be satisfied
with only a modest quota of ambiguity, and there are many that meet or
improve on that standard.- Hide quoted text -

- Show quoted text -

Further thinking - assuming base 95 is the chosen 0ne - that would
require 95**3 to cover the existing largest code point - that means
it requires four digit columns at least and there would still be a
problem in marking the space needed to segregate characters in the
hexadecimal pseudo-plaintext being transformed into base 95.

Comparing this with base 47 => 47 **3 would cover all existing code
points and future code points up to about 125 000 and it would also
require four columns (it would have to be increased to five columns if
the full potential of 1,000,000 (+) code points was ever realised in
the future in Unicode. - for the present let's say that is not likely
to happen yet and just go for what's in vogue now in Unicode - so
47**3 is sufficient and there seems no point in going for base 95 that
would give no great advantage.

Comparing base 47 with base 16.

What is deterring to the writer is the fact that hexadecimal only
needs two more columns to satisfy all code points both now and the
future and there is no problem for ASCII_Pad to read everything in,
character by character including spaces.

Given that Alice's time is still the most expensive element of any
crypto scheme the question begs - is it going to be cost effective to
change from hexadecimal to any tertiary base that will only reduce the
ciphertext expansion by only 1/3 - (that tertiary work would have to
be done by Alice).

I hesitate to say 'no' myself but as I see it, this proposal would
have to be tested properly in a crypto lab execise to confirm the
idea.

The other side of the coin is that ASCII_Pad working in ASCII as a
surrogate medium is very smooth and trouble free. The question has to
be asked -Is it wise to change it. My own belief is that this is the
most computer friendly crypto scheme of the three possibilities
currently on the table. I say this in spite of years of very hard
work on vector cryptography that I am also promoting. - Cheers -
Austin
.



Relevant Pages

  • Re: Fast UTF-8 strlen function
    ... >> Is there a fast UTF-8 string length function floating around? ... Length in bytes, or length in characters? ... For UTF-8, the main basic "change" you have to make to your string routines ... then I could individually look up the characters in my UNICODE ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)
  • Re: Prothon should not borrow Python strings!
    ... """It does not make sense to have a string without knowing what encoding ... same cul de sac as Python. ... Prothon_String_As_ASCII // raises error if there are high characters ... Python's split between byte strings and Unicode strings is ...
    (comp.lang.python)
  • Re: Unicode Support
    ... >> (I know this is a poor example, but think about other languages, eg ... First things first, when you register your RosAsm windows classes, you ... the messages with ANSI / UNICODE parameters in ANSI or UNICODE form... ... with their alphabet characters, as with the numbers and punctuation...so, ...
    (alt.lang.asm)
  • =?windows-1252?Q?Re=3A_Encrypting_Unicode_=96_Using_ASCII_as_a_Surrogat?= =?windows-1252?Q?e
    ... characters of an exotic eastern language using an ASCII keyboard. ... communicate in large volume with China or Japan using CJK from Unicode ... present the message text to Alice as a string of hexadecimal numbers ... by the computer as an external file and enciphered by a stream cipher ...
    (sci.crypt)