Re: loop unrolling tricks



On Mar 16, 1:59 pm, "Wei Dai" <use...@xxxxxxxxxx> wrote:
I'd like to report an assembly language optimization technique that I
haven't seen used before. The idea is to unroll a loop while
not increasing the code size as much as the usual way, by putting the loop
body into a separate naked function (i.e., without any explicit parameter
passing, prologue, or epilogue since it has knowledge of the caller's stack
and register allocations), and then calling it a fixed number of times
corresponding to the loop iteration count.

On a modern CPU where predictable branches are almost costless, there are
still a couple of reasons to unroll a loop: 1) it's a nested loop and the
iteration counts of the inner loop are not constant, thereby causing branch
mispredictions (e.g., Comba multiplication), and 2) using a register to hold
the iteration count would cause a register spill (e.g., AES encryption on
x86).
In the first case this technique can be combined with another trick, which
could be called overlapping functions, to make the branches fully
predictable. Here's an illustration:

// original code
for (int i=1; i<=3; i++)
{
for (int j=0; j<i; j++)
{
inner_loop_body;
}
outer_loop_body;

}

// unrolled assembly
call label3
outer_loop_body
call label2
outer_loop_body
call label1
outer_loop_body
...
label1:
inner_loop_body
label2:
inner_loop_body
label3:
inner_loop_body
ret

Using these ideas I was able to do a 2048-bit RSA decryption in 9.7e6 cycles
on an Intel Core 2 (single-threaded, in 32-bit mode using SSE2
instructions).
(This code will be in the next version of Crypto++.) Naive loop unrolling
was
about 30% slower, apparently due to thrashing of the L1 instruction cache.

Would you mind giving an example of "Naive loop unrolling", just to be
sure that we understand it in the same way.


.



Relevant Pages

  • Re: Fastest method to copy/process a range of bytes?
    ... My problem with absolute addressing is that it is, well, absolute. ... The problem with my mult and div code is that I have another loop that needs to run very fast inside the x/y loop. ... > Another idea is to unroll the first inner loop. ... > avoided, saving 10 cycles. ...
    (comp.sys.apple2.programmer)
  • Re: Memset me up Scotty.
    ... >> I'm trying to pull out a fast memset routine out of my magic hat ... use a jnc as you do inside your loop. ... > MMX intrinsics. ... The only other way to unroll like that is ...
    (comp.lang.asm.x86)
  • Re: Memset me up Scotty.
    ... Hey Matt or anyone else that can help me out for that matter! ... use a jnc as you do inside your loop. ... > You can use MMX intrinsics with Duff's Device to get output approximately ... The only other way to unroll like that is to ...
    (comp.lang.asm.x86)
  • Re: The future of CPU architecture
    ... the only thing that worked properly was to unroll all 8 ... Each iteration took so long that the target PentiumPro cpu would have plenty of time to do the loop overhead instructions while waiting for the loop body to finish. ... Rijndael employs more but simpler iterations than DFC afair, this makes it much more likely that unrolling will help. ... All AES candidates had to run well in sw on the target P6 cpu, DFC had an achilles heel in that it required a 64x64->128 MUL in the core code, something which is hard to persuade a 32-bit compiler to generate fast code for. ...
    (comp.arch)
  • Re: piplining principles (and confusion!)
    ... there is no way to unroll the inner loop by ... It is loop invarient code that a compiler will hoist. ... > you won't flush the pipeline because the code is far off, ... Normally, this sort of ...
    (comp.lang.asm.x86)