Re: Conventional DES byte order?



According to Eric Young <eay_nospam@xxxxxxxxx>:
Hmm... I think our definitions of endian are different.

Probably. I am talking about the mapping between input data and the
formal specification of DES, whereas you are talking about how your
implementation technically maps input data into CPU registers. I _think_
that the original poster was talking about the former, not the latter.

If you prefer, there is an "input order" (the order of arrival of the
input data bits from whatever transport medium is used), a "formal
order" (the numbering in the DES specification) and a "working order"
(how the implementation will internally store its intermediate values
within CPU registers). I was talking about the mapping between the input
order and the formal order. You are talking about the mapping between
the input order and the working order. The input-to-formal mapping is
about interoperability (the DES specification does not describe that
mapping, but all protocols and implementations that I know of choose the
same, big-endian mapping for that). The input-to-working mapping is
about performance, and it should be entirely internal to the
implementation (i.e. how the implementation performs its work must not
alter the final output, and users of the implementation need not be
aware of how things are done internally).


The numbering used inside the standard is sort of irrelevant. This is
a standard that uses the least significant bit of the key bytes for
parity, but call it bit 7.

Bit 8, actually. They number from 1, not from 0. I found it a bit
strange when I first implemented DES myself.


Anyway, DES (and thrice more so 3DES) is quite CPU-hungry; a single DES
block encryption will use about 300 clock cycles. I don't think that how
the block is read into two 32-bit registers really impacts performance;
we are talking about less than a dozen clock cycles here. However,
performance of decoding of sequences of 4 (resp. 8) bytes into 32-bit
(resp. 64-bit) resgisters may become an important issue with algorithms
which munch heavy loads of data, namely hash functions. In my own
implementations of hash functions, in sphlib(*), I use a set of decoding
inline functions which are mapped, when possible, to direct accesses or
possibly some inline assembly. E.g., for i386, I use a direct 32-bit
memory access, and possibly an assembly opcode "bswap" if I need a
32-bit decode (for instance when implementing SHA-1). Aligned and
unaligned accesses must also be minded, especially for those
architectures where unaligned accesses trigger exceptions. Using these
"smart" accesses may boost performance up to about 20% in some rare
situations.


This problem does not exist on big-endian platforms :-).

It seems to me that the little-endian convention is currently winning
the endianness war. The x86 derivatives are little-endian only, and
most other platforms are endian neutral: ARM, MiPS, Sparc, Alpha...
can use either big-endian or little-endian (this is a matter of
convention between the motherboard, the operating system and the
compilers), and most of the embedded systems I come upon have chosen
little-endian, because it allows convenient simulation on Windows
systems.


--Thomas Pornin

(*) http://www.crypto-hash.fr/modules/wfdownloads/singlefile.php?cid=9&lid=5
.


Quantcast