Re: LibTomCrypt ASN.1...



Phil Carmody wrote:
Shame - in competant hands G5s are noticably faster than any
x86 variant per clock tick. (30% faster, I'd say, and that's
both for memory-bound and compute-bound tasks.)

I don't recall that being the general case. Overall it has fewer
execution units, less cache with fewer ways and IIRC used a FSB.
Reading the apple "G4 vs G5" comparisons it says at best case memory
access is 135ns. Opterons at similar clock have lower latency.

The only significant advantage is the ability to run two double
precision floating point ops per cycle. Even with SSE2 the Opterons
issue two macro-ops and take at least two cycles since they target the
same port.

It looks like as for the instruction flow it groups things in lines of
upto five ops (to upto one of five ports) whereas the AMD core does so
in groups of three. As a result the ICU window is larger on the G5
(200 units, or 40 lines of 5 ops) whereas the AMD has 24 lines of 3
macro-ops.

For raw ALU performance AMD still wins. It can issue upto 6 micro-ops
per cycle to the AGU and ALU units provided their dependencies have
been satisfied. Most register/register instructions are single cycle
and the load store access is accessible by all three ALU pipes.
According to a review at Ars Technica two dependent ALU ops cannot be
back to back without a 1 cycle penalty. This is not true in the AMD
world.

Fow raw FPU performance G5 has more issue ports per cycle, assuming
they are symmetric and have the same latencies [or better] than AMD it
would perform better.

The L1 cache could be faster in G5 world [I haven't seen any latency
claims] as it is direct mapped but it is also more likely to be trashed
by large unrolled code (of the sort that G5 and AMD like).

The L2 cache is both larger and has eight times the ways in AMD. This
means you're way more likely to have an L2 hit than in the G5 world.

The memory bus is lower latency in Opteron world but both have
relatively the same bandwidth.

Overall I don't doubt there are specific algorithms that work better on
G5 than Opteron. I seriously doubt the "general case" of being 30%
more efficient. Specially since the G5 was normally only compared
against the P4 which even compared to the Opteron is vastly less
efficient.

Now Intel is coming out with better cores. The "Core" series [I dunno
if it's what they have now or part of the MCW series] looks to copy a
lot from K8. Except they widen the macro-op line to 4, appear to have
two full 128-bit SSE ports and what looks like [iirc] three decently
full ALU paths. I don't know if that means all three ALUs will do the
int/shift/rotate opcodes. The P4 had "two" ALUs and that only meant
simple integer ops not shift/rotates.

Eitherway it looks to compare well against Opteron and would easily
beat G5 both on execution resources and memory bandwidth (something
Intel is sadly king at).

Tom

.



Relevant Pages

  • Re: AMD vs Intel - Ghz & performance question
    ... Intel is phasing out the Pentium 4 because increasing the ... > philosophy to the AMD processors, i.e. do more work per clock cycle. ... > memory controller and its Hypertransport bus. ...
    (comp.sys.ibm.pc.hardware.chips)
  • Re: Superstitious learning in Computer Architecture
    ... Without a LOT of logic or some other better approach, re-executing the instructions requires re-decoding and it ties up the cache memory bus transferring more data as instructions than the instructions are working on. ... There is most of an order of magnitude in speed sacrificed by even HAVING a cache in a single ALU system, and more than an order of magnitude in multiple-ALU systems! ...
    (comp.arch.arithmetic)
  • Re: Delphi 2008 native?
    ... Microsoft is in bed with Intel and AMD.. ... An important difference between .NET and Pascal is that ..NET potentially can move objects to new memory locations *during* execution, so that memory fragmentation is avoided. ...
    (borland.public.delphi.non-technical)
  • Re: Superstitious learning in Computer Architecture
    ... Without a LOT of logic or some other better approach, re-executing the instructions requires re-decoding and it ties up the cache memory bus transferring more data as instructions than the instructions are working on. ... The concept of cache is fundamentally flawed in that it STILL restricts access to one word per clock cycle, when a single modern ALU can easily use 5 plus whatever is eaten up with instruction accesses. ... The size of an optimizing compiler is proportional to the SQUARE of the size of the language times the SQUARE of the complexity of the machine - because all interactions must be considered. ...
    (comp.arch.arithmetic)
  • Re: Problem with playback of recorded tv (new)
    ... Media Center Edition ... onboard memory does it have.... ... Do you have the current AMD XP X2 CPU driver from the AMD Website ...
    (microsoft.public.windows.mediacenter)