SHA-256 logic circuit / bitcoin mining



hi

Some time ago I've posted here about my SHA-256 speedup attempt via boolean functions (logic circuits). Here's a more rigorous viability analysis (cross-post from comp.theory):

----
SHA-256 has
7*64 + 48*3 + 8 = 600 add operations
6*64 + 4*48 = 576 bitwise rotations
2*48 = 96 bitwise shifts
12*64 + 4*48 = 960 bitwise operations
(number might be slightly different for different implementation)

Together that would be 2232 operations, only 600 of which are additions. But additions make the meat of cryptographic diffusion and require many logic gates to implement.

On a CPU which requires one cycle per each operation it takes 2232*32 = 71424 cycles to compute 32 SHA-256 outputs.

Logic gate implementation doesn't need to implement rotations and shifts as they just reorder inputs.
Full 32-bit adder requires 160 bit ops,
So what's left is 600 * 160 + 960 * 32 = 126720.
(It can be further reduced a bit through constant propagation of zeroes from shifts.)

So we have 71424 against 126720. That's worse, but not five times worse, not even twice worse.

Furthermore, some CPU architectures do not implement rightrotate natively and have to implement it though three operations (two shifts and OR). They would require 3384*32 = 108288 operations, which is pretty close.
----

(This would be NVIDIA GPU. One with native ROTR is AMD GPU (which also implements MUX instruction which can shave cycle count even further).

So number of operations is close, but logic circuit implementation allows additional optimizations. Even 20% speedup would make it competitive on NVIDIA GPU and ~50% speedup -- on AMD GPU.

What optimizations are possible?

If we're doing pre-image attack or bitcoin mining, we can fix values of some variables and vary only few of them, say, 32 (size of nonce to pick in bitcoin mining). Fixing 736 variables out of 768 enables one to do 'constant propagation' kind of optimization.

This applies both to logic circuit implementation and to a traditional one, but logic circuit one makes more use of it, as it can optimize parts of XOR or ADD with a constant, while traditional implementation would just do full operation.

Then, it's possible to use some symmetries to reduce number of computations required, say, via binary decision diagrams. But that works only for the first few rounds, after that it's fully diffused and BDDs will be basically equivalent to full truth tables. You can't compress a random signal, you know. So savings aren't huge. (Although I don't yet know the point at which symmetry is totally lost.)

I played with these and other ideas for a while, and it looks like there's simply no way to compute SHA-256 significantly faster, i.e. optimizations are minuscule.

There is however one thing which bugged me... Let's say we're looking for input such that output is all ones (it is easy to convert any pre-image attack to this form), so we need to verify that

O1 and O2 and O3 and .. and O32 = TRUE

Now imagine that outputs O1, O2 .. O32 are represented in form of BDD or a disjunctive normal form (DNF). As SHA-256 is supposed to be indistinguishable from random, DNF has about 2^31 member, and BDD is huge. Or if you look at truth table, it is 50% ones and 50% zeros randomly mixed.

Now if you consider a conjucnction (O1 and O2) it has only 25% of ones, and thus smaller DNF (and BDD). Likewise, (O1 and O2 and O3) has even less ones, and so on, until we are left just with handful of ones -- or none at all -- which are actually a solution to our problem.

So there is a great simplification at the last step, can we drive it down so whole computation gets simpler?

If we are going to compute it precisely -- I don't think so.

But we do not need to be exact here. Just like in number theory there are probabilistic primality tests, we only need to find such inputs where output is POTENTIALLY true with probability higher than 50% (higher = better), and then do full check on them.

Alternatively, we can rule out inputs which are definitely false and do a full check on the rest. If first part can be done cheaply, we'll have a speedup compared to baseline.

One way to do this is to find some approximate truth table where zero is actually zero, and one is "maybe one, maybe zero", or "unknown", in other words.

When logic circuit is converted into a form where it has only AND and OR gates it is monotonic, i.e. flipping one input to 1 might only flip ouput zeros to ones, but never ones to zeros.

This means we can inject extra ones into truth function at any step (say, to make it more symmetric) and that would only inject some fake ones input output, but we can still trust zeros.

My first attempt was to inject extra one directly into input variable, i.e. assume both X1 and NOT(X1) are true at once. But it doesn't work very well -- output is all ones. This happens because this lossy representation injects 1/8 more ones for each AND operation, and so it quickly turns to all ones.

Next try was to replace one of logic gates with 1 without actually computing it. Then output of a whole function is correct on input combinations where that gate's output was indeed one, i.e. in 1/2 or 1/4 of all combinations. Otherwise it might be true or false depending on luck.

So that's already some speedup, as there is no need to compute output of that gate (although a very small one), and if that fudge doesn't affect output of all gates we still get lots of zeros.

When we look at it from truth-table point of view, essentially we replace truth table of one gate with all ones leaving all other intact.
This is probably not the best way to do it. What if we inject few ones at different nodes as needed, i.e. to make BDD more symmetric? I.e. if we have a fork node in BDD

0101 0100

We can replace it with eq node 0101 = 0101 OR 0100. And potentially we need to do less operations when we have more eq nodes.

But this shit is fairly complex and I doubt it will work, so I'm considering other options, say, working with DNFs.

In DNF, conjunction with fewer variables produces more ones, so maybe it makes sense to start with those to get at least some ones cheaply?

.