## [zcash/zcash] Understand how to optimize a circuit implementation of Groth16 verification (#3425)

Next is the final exponentiation, for which we need arithmetic in F_{p12} and its cyclotomic subgroup G_{Φ6}(F_{p2}).

For multiplication in F_{p12}, we can use Karatsuba over F_{p6} which gives 3M = 45C. For squaring we can use Complex squaring over F_{p6} which gives 2M = 30C.

(Those costs are for the tower F_{p2} → F_{p6} → F_{p12}. If we instead used F_{p2} → F_{p4} → F_{p12} with Toom-3 over Karatsuba over Karatsuba, multiplication would require 15m~ = 45C and squaring 10m~ = 30C, so there would be no difference.)

For squaring in G_{Φ6}, we represent elements in the compressed representation (g_{2}, g_{3}, g_{4}, g_{5}) of [Karabina](https://eprint.iacr.org/2010/542.pdf). We can use the main algorithm in Karabina’s paper which requires 4m~ = 12C, or the «SQR_{2345}» one from that paper (with 2 squarings in F_{p4}) which is also 12C, or the algorithm in section 5.2 of https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf which requires 6s~ which again is 12C. So we can see that this saves significantly over the 30C needed to do squarings directly in F_{p12}.

For multiplication in G_{Φ6}, we cannot use the compressed representation. In the case needed for the final exponentiation where the exponent is sparse, it is most efficient to compute all of the squarings in compressed format and then decompress the ones we need. A decompression uses the formulae:

g_{0} = (2.g_{1}^{2} + g_{2}.g_{5} — 3.g_{3}.g_{4}).ξ + 1

g_{1} = { (g_{5}^{2}.ξ + 3.g_{4}^{2} — 2.g_{3}) / 4.g_{2}, if g_{2} ≠ 0

{ (2.g_{4}.g_{5}) / g_{3}, if g_{2} = 0

The subexpressions common to both cases are g_{1}^{2} and g_{3}.g_{4}. We require 3C for a «boolean scaling» which computes a boolean b which is 0 when g_{2} = 0, otherwise 1:

(1 — b) × (b) = 0

(g_{2}) × (λ) = (b)

(μ) × (b) = (g_{2})

and then 4C to select the numerator and denominator of the division (since each element of F_{p2} is 2 field elements). Note that computing two divisions and then paying 2C for a single selection is less efficient; trying to combine the boolean scaling with the selection (which would work if we only had a single selection) is also less efficient.

We then need 2s~ + m~ to compute g_{5}^{2}, g_{4}^{2}, and g_{4}.g_{5}, and i~ for the division. Overall this costs s~ + 2m~ + 3C + 4C + 2s~ + m~ + i~ = (2 + 6 + 3 + 4 + 4 + 3 + 3)C = 25C for each decompression.

So when u is sparse with k bits set and is ℓ bits in length, exponentiation by u (which is a component of the «hard part» of the final exponentiation) costs ℓ \* 12C + (k-1) \* (25 + 45)C = ℓ \* 12C + (k-1) \* 70C. The Karabina squaring and decompression pays off whenever (30 — 12)\*ℓ > 25\*(k-1), i.e. whenever k < 0.72\*ℓ + 1, which is true for typical pairing-friendly curves.
We also need p-power and p^{2}-power Frobenius operations, i.e. g ↦ g^{p} and g ↦ g^{p2}, for which we can use the formulae in section 3.2 of https://eprint.iacr.org/2010/354.pdf . These formulae only involve constant multiplications and conjugations so they are free in the circuit.

For the curve used in [Aranha, Karabina, Longa, Gebotys and López](https://www.iacr.org/archive/eurocrypt2011/66320047/66320047.pdf):

> The final exponentiation executes in total 1 inversion, 4 conjugations, 15 multiplications, 3 u-th powers, 4 cyclotomic squarings, 5 p-power Frobenius, 3 p^{2}-power Frobenius.

So that is:

* 45C for the inversion in F_{p12};

* the conjugations are free;

* 15 \* 45C for the multiplications in F_{p12};

* 3 \* (62 \* 12C + 2 \* 70C) = 3 \* 884C for the 3 u-th powers;

* 4 \* 12C for the cyclotomic squarings;

* the Frobenius computations are free

giving a total of 6120C for the final exponentiation.