Re: More LTC timings...

From: Eric Young (eay_nospam_at_pobox.com)
Date: 06/19/03


Date: Thu, 19 Jun 2003 08:59:45 +1000

Brian Gladman wrote:
> "Shill" <nobody@example.com> wrote in message
> news:bcppb1$2sjb$1@biggoron.nerim.net...
>
>>>As you suggest, in both cases this is almost certainly the stress
>>>on the processor caches - which is especially in evidence on the
>>>P4 with only 8k at L1.
>>
>>Northwood's 512 KB L2 cache is only 7 cycles away...
>>
>>www.aceshardware.com/Spades/read.php?article_id=20000192
>
>
> Interesting analysis. It seems that AES, which typically has large tables,
> hits the P4 quite hard though.
>
> But my main point is to back up Eric's comment that we should treat the
> often quoted naive TSC based algorithm timings with considerable scepticism.
>
> Brian Gladman

The classic case I remember was way back in the 'lets make fast DES
implementation days', when the variants needed to do the Unix
password were being implemented at high speed.
I had a version that used 8 6bit look ups, for 2k worth of tables.
UltraFast Crypt (I think that was the name), used 4 12bit lookups
for 64k worth of tables.
The 'measure speed' program used by UFC, was repeated encrypts of
the same key value. By this benchmark, it was about 2 times faster
than the 2k table version. BUT, if you changed the password each
time, the 2k table version was 2 times faster. These numbers were
being generated on a machine with a 32k cache. For a single pass,
the relevant values from the 64k table were being loaded into the L1
cache, and all subsequent uses came from the cache. The table
was only being partly loaded. For a password cracking, the
UFC test was crud.

Loading tables also applies to AES using big tables.
If you repeatedly encrypt
the same value, you will only load 160(?) values from your large
tables. How many times does a program repeatedly encrypt the same
value? I also tend to always measure cbc mode (perhaps I should
move to counter mode) since the calling convention and the 'byte
to work/endian conversion' overhead is also relevant.

I also believe I've seen quite a few algorithms where heavy
loop unrolling is bad for the pentium4. It has an 8k(?)
trace cache, which is the number of decoded instructions.
To do the initial decode, instructions are only processed
1 at a time. If your code does not fit in the trace cache,
it will always execute at one instruction per cycle.

eric



Relevant Pages

  • Re: More LTC timings...
    ... >> Loading tables also applies to AES using big tables. ... How many times does a program repeatedly encrypt the same ... > into the cache. ... which is the number of decoded instructions. ...
    (sci.crypt)
  • Re: Throttling Process CPU Utilization
    ... >> I run a bonch of BOINC processes on my machine (SETI@home, ... My processor chips cost $750 each, and these days, a 1 Megabyte ... L3 cache is pretty pathetic. ... instructions, and some of them were completely worthless. ...
    (comp.os.linux.misc)
  • Re: Throttling Process CPU Utilization
    ... >> I run a bonch of BOINC processes on my machine (SETI@home, ... My processor chips cost $750 each, and these days, a 1 Megabyte ... L3 cache is pretty pathetic. ... instructions, and some of them were completely worthless. ...
    (comp.os.linux.development.system)
  • Re: [PATCH] xen: core dom0 support
    ... 1f4f931501e9270c156d05ee76b7b872de486304) to improve pvops ... Well it's the L2 cache references which are being measured here, ... instructions, but we know that there's a lot more going on), ... It's measuring kernel stats too - and i very much saw the ...
    (Linux-Kernel)
  • Re: Hyperthreading vs. SMP
    ... >> How is memory contention maintained ... sharing the same cache. ... > the superscaler processor has multiple instructions in flight already ... > processor may also have speculative execution when conditional ...
    (linux.redhat)

Quantcast