Re: More LTC timings...

From: Mok-Kong Shen (mok-kong.shen_at_t-online.de)
Date: 06/19/03


Date: Thu, 19 Jun 2003 10:48:47 +0200


Tom St Denis wrote:
>
> Eric Young wrote:
> > Loading tables also applies to AES using big tables.
> > If you repeatedly encrypt
> > the same value, you will only load 160(?) values from your large
> > tables. How many times does a program repeatedly encrypt the same
> > value? I also tend to always measure cbc mode (perhaps I should
> > move to counter mode) since the calling convention and the 'byte
> > to work/endian conversion' overhead is also relevant.
>
> True but not always. My timing encrypts [decrypts] the ciphertext from
> the previous iteration so the input isn't the same even in a given
> timing loop. Brian's code is similar.
>
> In the case of the Athlon the cache lines are 64 bytes long so it
> doesn't take long for the entire table [of most ciphers] to be loaded
> into the cache. For example, in a single 128-bit key AES encryption
> there will be at least 160 table lookups or 40 per 8x32 table. A 8x32
> table will span 16 cache lines [on the P4/XP] which means there is a
> good chance the majority of the cache lines will be hit.
>
> > I also believe I've seen quite a few algorithms where heavy
> > loop unrolling is bad for the pentium4. It has an 8k(?)
> > trace cache, which is the number of decoded instructions.
> > To do the initial decode, instructions are only processed
> > 1 at a time. If your code does not fit in the trace cache,
> > it will always execute at one instruction per cycle.
>
> Again its a play on diminishing returns [its a 12k uop cache btw]. With
> the athlons provided the unrolled loop fits in a cache line its all
> good. I believe GCC actually will check if the code will fit in a cache
> line before unrolling.

In a previous post, you said that you were continuing
to attempt to get better efficiency out of your code.
Is that work finished now and one could download it to
test on one's own machine (since, as also discussed
in the other thread, the performance is quite dependent
on the machine type and not only on the Mhz)? I suppose
that the installation of your code is straightforward.
Thanks in advance.

M. K. Shen



Relevant Pages

  • Re: More LTC timings...
    ... Brian Gladman wrote: ... cache, and all subsequent uses came from the cache. ... Loading tables also applies to AES using big tables. ... which is the number of decoded instructions. ...
    (sci.crypt)
  • Re: More LTC timings...
    ... > Loading tables also applies to AES using big tables. ... How many times does a program repeatedly encrypt the same ... In the case of the Athlon the cache lines are 64 bytes long so it ... I believe GCC actually will check if the code will fit in a cache ...
    (sci.crypt)
  • Re: Throttling Process CPU Utilization
    ... >> I run a bonch of BOINC processes on my machine (SETI@home, ... My processor chips cost $750 each, and these days, a 1 Megabyte ... L3 cache is pretty pathetic. ... instructions, and some of them were completely worthless. ...
    (comp.os.linux.misc)
  • Re: Throttling Process CPU Utilization
    ... >> I run a bonch of BOINC processes on my machine (SETI@home, ... My processor chips cost $750 each, and these days, a 1 Megabyte ... L3 cache is pretty pathetic. ... instructions, and some of them were completely worthless. ...
    (comp.os.linux.development.system)
  • Re: [PATCH] xen: core dom0 support
    ... 1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops ... Well it's the L2 cache references which are being measured here, ... instructions, but we know that there's a lot more going on), ... It's measuring kernel stats too - and i very much saw the ...
    (Linux-Kernel)