Re: loop unrolling tricks



On Mar 16, 1:59 pm, "Wei Dai" <use...@xxxxxxxxxx> wrote:
I'd like to report an assembly language optimization technique that I
haven't seen used before. The idea is to unroll a loop while
not increasing the code size as much as the usual way, by putting the loop
body into a separate naked function (i.e., without any explicit parameter
passing, prologue, or epilogue since it has knowledge of the caller's stack
and register allocations), and then calling it a fixed number of times
corresponding to the loop iteration count.

On a modern CPU where predictable branches are almost costless, there are
still a couple of reasons to unroll a loop: 1) it's a nested loop and the
iteration counts of the inner loop are not constant, thereby causing branch
mispredictions (e.g., Comba multiplication), and 2) using a register to hold
the iteration count would cause a register spill (e.g., AES encryption on
x86).
In the first case this technique can be combined with another trick, which
could be called overlapping functions, to make the branches fully
predictable. Here's an illustration:

// original code
for (int i=1; i<=3; i++)
{
for (int j=0; j<i; j++)
{
inner_loop_body;
}
outer_loop_body;

}

// unrolled assembly
call label3
outer_loop_body
call label2
outer_loop_body
call label1
outer_loop_body
...
label1:
inner_loop_body
label2:
inner_loop_body
label3:
inner_loop_body
ret

Using these ideas I was able to do a 2048-bit RSA decryption in 9.7e6 cycles
on an Intel Core 2 (single-threaded, in 32-bit mode using SSE2
instructions).
(This code will be in the next version of Crypto++.) Naive loop unrolling
was
about 30% slower, apparently due to thrashing of the L1 instruction cache.

Would you mind giving an example of "Naive loop unrolling", just to be
sure that we understand it in the same way.


.



Relevant Pages

  • Re: Fastest method to copy/process a range of bytes?
    ... My problem with absolute addressing is that it is, well, absolute. ... The problem with my mult and div code is that I have another loop that needs to run very fast inside the x/y loop. ... > Another idea is to unroll the first inner loop. ... > avoided, saving 10 cycles. ...
    (comp.sys.apple2.programmer)
  • Re: Memset me up Scotty.
    ... Hey Matt or anyone else that can help me out for that matter! ... use a jnc as you do inside your loop. ... > You can use MMX intrinsics with Duff's Device to get output approximately ... The only other way to unroll like that is to ...
    (comp.lang.asm.x86)
  • Re: Memset me up Scotty.
    ... >> I'm trying to pull out a fast memset routine out of my magic hat ... use a jnc as you do inside your loop. ... > MMX intrinsics. ... The only other way to unroll like that is ...
    (comp.lang.asm.x86)
  • Re: piplining principles (and confusion!)
    ... there is no way to unroll the inner loop by ... It is loop invarient code that a compiler will hoist. ... > you won't flush the pipeline because the code is far off, ... Normally, this sort of ...
    (comp.lang.asm.x86)
  • Re: loop unrolling tricks
    ... The idea is to unroll a loop while ... corresponding to the loop iteration count. ... mispredictions, and 2) using a register to hold ...
    (sci.crypt)