[zcash/zcash] Understand how to optimize a circuit implementation of Groth16 verification (#3425)

Next is the final exponentiation, for which we need arithmetic in Fp12 and its cyclotomic subgroup GΦ6(Fp2).

For multiplication in Fp12, we can use Karatsuba over Fp6 which gives 3M = 45C. For squaring we can use Complex squaring over Fp6 which gives 2M = 30C.

(Those costs are for the tower Fp2 → Fp6 → Fp12. If we instead used Fp2 → Fp4 → Fp12 with Toom-3 over Karatsuba over Karatsuba, multiplication would require 15m~ = 45C and squaring 10m~ = 30C, so there would be no difference.)

For squaring in GΦ6, we represent elements in the compressed representation (g2, g3, g4, g5) of [Karabina](https://eprint.iacr.org/2010/542.pdf). We can use the main algorithm in Karabina’s paper which requires 4m~ = 12C, or the «SQR2345» one from that paper (with 2 squarings in Fp4) which is also 12C, or the algorithm in section 5.2 of https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf which requires 6s~ which again is 12C. So we can see that this saves significantly over the 30C needed to do squarings directly in Fp12.

For multiplication in GΦ6, we cannot use the compressed representation. In the case needed for the final exponentiation where the exponent is sparse, it is most efficient to compute all of the squarings in compressed format and then decompress the ones we need. A decompression uses the formulae:

    g0 = (2.g12 + g2.g5 — 3.g3.g4).ξ + 1
    g1 = { (g52.ξ + 3.g42 — 2.g3) / 4.g2, if g2 ≠ 0
            { (2.g4.g5) / g3, if g2 = 0

The subexpressions common to both cases are g12 and g3.g4. We require 3C for a «boolean scaling» which computes a boolean b which is 0 when g2 = 0, otherwise 1:

    (1 — b) × (b) = 0
    (g2) × (λ) = (b)
    (μ) × (b) = (g2)

and then 4C to select the numerator and denominator of the division (since each element of Fp2 is 2 field elements). Note that computing two divisions and then paying 2C for a single selection is less efficient; trying to combine the boolean scaling with the selection (which would work if we only had a single selection) is also less efficient.

We then need 2s~ + m~ to compute g52, g42, and g4.g5, and i~ for the division. Overall this costs s~ + 2m~ + 3C + 4C + 2s~ + m~ + i~ = (2 + 6 + 3 + 4 + 4 + 3 + 3)C = 25C for each decompression.

So when u is sparse with k bits set and is ℓ bits in length, exponentiation by u (which is a component of the «hard part» of the final exponentiation) costs ℓ \* 12C + (k-1) \* (25 + 45)C = ℓ \* 12C + (k-1) \* 70C. The Karabina squaring and decompression pays off whenever (30 — 12)\*ℓ > 25\*(k-1), i.e. whenever k < 0.72\*ℓ + 1, which is true for typical pairing-friendly curves. We also need p-power and p2-power Frobenius operations, i.e. g ↦ gp and g ↦ gp2, for which we can use the formulae in section 3.2 of https://eprint.iacr.org/2010/354.pdf . These formulae only involve constant multiplications and conjugations so they are free in the circuit.

For the curve used in [Aranha, Karabina, Longa, Gebotys and López](https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf):

> The final exponentiation executes in total 1 inversion, 4 conjugations, 15 multiplications, 3 u-th powers, 4 cyclotomic squarings, 5 p-power Frobenius, 3 p2-power Frobenius.

So that is:

* 45C for the inversion in Fp12;
* the conjugations are free;
* 15 \* 45C for the multiplications in Fp12;
* 3 \* (62 \* 12C + 2 \* 70C) = 3 \* 884C for the 3 u-th powers;
* 4 \* 12C for the cyclotomic squarings;
* the Frobenius computations are free

giving a total of 6120C for the final exponentiation.