ECC Point Multipliers

I'm doing more research into point multipliers. Trying to answer DJBs
challenge of fast random point multiplication. Specially now that I
have decently fast fixed point code :-)

I did a profile of the LTC code and it seems 45% of the time [at least
on my 885] is spent in Montgomery reduction. If I unroll the reduction
code from TFM I get that down to about 41% or so.

Unfortunately the moduli aren't 64-bit friendly (because NIST can't
choose a prime to safe their lives...) so the reduction techniques they
have aren't efficient. A diminished radix approach is faster "on
paper" but in practice not really (unless you inline the whole thing).
Even if I doubled the reduction performance I'm still well above DJBs
performance for Athlon processors.

The real crux of it is that I have to reduce the # of point operations.

Which is why I'm trying to find out as many point multiplier as I can.
So far I've found

1. double-and-add, basic but slow
2. k-ary sliding double/add, what I use now
3. w-NAF and w-FAN, nice on paper, not faster in practice, at least
not on desktops
4. Fixed Point, not useful for this task
5. Montgomery ladders, slow but secure against DPA

In particular, are there ways to compute 2^j * P on prime field curves
with Jacobian co-ordinates faster than doing j doubles (and no, the
trick where you trade 2 squarings for a bunch of additions is not worth
it)? Ideally if I can trade j doubles for j-1 doubles that would
reduce the number of PT doubles I do by a quarter.

Are there efficient ways of computing 2^j * P + Q simultaneously?

I'm toying with the idea of a simultaneous inversion to convert my
k-ary precomp points to affine. But given I only have upto w/4 point
adds the savings will only be 5w/4 mults+reduction [240 for ecc-192,
320 for ECC-256, etc...]. A single inversion cost ~70-100 mults so the
savings will be trivial at best...

Where does DJB get his performance [outside of massively
inlining/unrolling code?]? I thought he was using a k-ary double/add