Re: loop unrolling tricks



On Mar 16, 1:59 pm, "Wei Dai" <use...@xxxxxxxxxx> wrote:
I'd like to report an assembly language optimization technique that I
haven't seen used before. The idea is to unroll a loop while
not increasing the code size as much as the usual way, by putting the loop
body into a separate naked function (i.e., without any explicit parameter
passing, prologue, or epilogue since it has knowledge of the caller's stack
and register allocations), and then calling it a fixed number of times
corresponding to the loop iteration count.

On a modern CPU where predictable branches are almost costless, there are
still a couple of reasons to unroll a loop: 1) it's a nested loop and the
iteration counts of the inner loop are not constant, thereby causing branch
mispredictions (e.g., Comba multiplication), and 2) using a register to hold
the iteration count would cause a register spill (e.g., AES encryption on
x86).
In the first case this technique can be combined with another trick, which
could be called overlapping functions, to make the branches fully
predictable. Here's an illustration:

// original code
for (int i=1; i<=3; i++)
{
for (int j=0; j<i; j++)
{
inner_loop_body;
}
outer_loop_body;

}

// unrolled assembly
call label3
outer_loop_body
call label2
outer_loop_body
call label1
outer_loop_body
...
label1:
inner_loop_body
label2:
inner_loop_body
label3:
inner_loop_body
ret

Using these ideas I was able to do a 2048-bit RSA decryption in 9.7e6 cycles
on an Intel Core 2 (single-threaded, in 32-bit mode using SSE2
instructions).
(This code will be in the next version of Crypto++.) Naive loop unrolling
was
about 30% slower, apparently due to thrashing of the L1 instruction cache.

Would you mind giving an example of "Naive loop unrolling", just to be
sure that we understand it in the same way.


.