D. J. Bernstein
Fast arithmetic

# The transposition principle

## Transposition of a matrix

Fix positive integers r and c. Fix a matrix M with r rows and c columns. The standard linear algorithm to compute Mx, given a vector x = (x_1, x_2, ..., x_c), takes rc multiplications by the entries of M and r(c-1) additions.

Some standard optimizations: If an entry of M is known to be 1, its multiplication can be skipped. If an entry of M is known to be 2, its multiplication can be replaced by an addition (a doubling). If an entry of M is known to be 0, its multiplication and corresponding addition can be skipped. (Exception: the addition cannot be skipped if there is no addition, i.e., if the entire row of M is known to be 0.)

Write M* for the transpose of M. The standard linear algorithm to compute M*y, given a vector y = (y_1, y_2, ..., y_r), takes rc multiplications by the entries of M and (r-1)c additions.

The transposition principle states that the multiplication by M* has almost exactly the same complexity as the multiplication by M. More precisely, it has the same number of multiplications and exactly r-c more additions. Furthermore, the standard optimizations save the same number of multiplications for M* as they do for M, and the same number of additions for M* as they do for M. (Exception: a row of M known to be 0 loses an addition for M; a column of M known to be 0 loses an addition for M*. In other words, transposition preserves the number of additions plus the number of nontrivial outputs.)

## Transposition of a circuit

A linear circuit is, by definition, a sequence of matrices M_s, ..., M_2, M_1 with sizes suitable for a product M_s...M_2M_1. The linear circuit specifies a computation of M_s...M_2M_1x, given a vector x: multiply x by M_1, multiply the result by M_2, and so on.

The transposition principle states that the computation for the transposed linear circuit M_1*, M_2*, ..., M_s* has almost exactly the same complexity as the computation for the linear circuit M_s, ..., M_2, M_1.

More precisely, the transposed computation has the same number of multiplications as the original computation, and the same number of additions plus the length of M_s...M_2M_1x minus the length of x. As above, the standard optimizations produce the same savings for the transposed computation as for the original computation. (Exception: a row of M_i known to be 0 loses an addition for M_s,...,M_1; a column of M_i known to be 0 loses an addition for M_1*,...,M_s*.)

Consider, for example, the linear circuit

```     2  ,  2 1  ,  2
1
```
with product (10). This circuit specifies the following computation of (10x), given (x): multiply (2,1) by (x), obtaining (2x,x) with one doubling; multiply (2 1) by the result, obtaining (5x) with one doubling and one addition; multiply (2) by the result, obtaining (10x) with one doubling. The total is three doublings and one addition. The transposed circuit
```     2 1  ,  2  ,  2
1
```
specifies a different computation of (10x), given (x): multiply (2) by x, obtaining (2x) with one doubling; multiply (2,1) by the result, obtaining (4x,2x) with one doubling; multiply (2 1) by the result, obtaining (10x) with one doubling and one addition. The total is again three doublings and one addition.

## Why the transposition principle is useful

The literature is littered with examples of people separately discovering linear circuits for M and linear circuits for M* without realizing that the circuits are transposes of each other. The transposition principle allows the circuits to be treated together, with speedups in either circuit translated immediately to speedups in the other. Understanding, analyzing, and improving a linear circuit also means understanding, analyzing, and improving its transpose.

Consider, for example, the following two problems:

• Product: Given a complex polynomial u of degree below n, and a complex polynomial v of degree at most n, compute the product uv. An efficient way to do this is to multiply u by v modulo x^2n-1, using three size-2n FFTs.
• Middle product: Given a complex polynomial u of degree below n, and a complex polynomial w of degree below 2n, compute floor((uw mod x^2n)/x^(n-1)); i.e., the coefficients of x^(n-1),x^n,x^(n+1),...,x^(2n-1) in the product uw. An efficient way to do this is to multiply u by w modulo x^2n-1, using three size-2n FFTs, and extract the coefficients of x^(n-1),x^n,x^(n+1),...,x^(2n-1).
With an obvious choice of bases, the matrix for the map from v to uv is the transpose of the matrix for the map from w to floor((uw mod x^2n)/x^(n-1)), and the above linear circuits are transposes of each other, so it is unsurprising that the circuits involve almost exactly the same amount of work. A better linear circuit for either map implies a better linear circuit for the other. (For example, tracing the initial 0s in the size-2n transform of v corresponds to tracing the unused coefficients of x^0,x^1,... in uw.)

On the other hand, my experience is that it is a mistake to use transposition in the statement of an algorithm. I don't agree with Bostan's assertion that ``the transposition principle can, and should, be strictly complied with in practice.'' For example, if I had described the middle-product circuit above as merely a transpose of the product circuit, without also describing it as multiplication mod x^2n-1, then I would have made the algorithm unnecessarily difficult to implement and to optimize, and I would have failed to make clear that one can compute middle products of integers as quickly as products of integers. The 2004 middle-product paper by Hanrot, Quercia, and Zimmermann is a good example of this obfuscation.

Consider, as another example, the following two linear circuits for the same problem:

• Brauer's addition chain for 31415. Given x, compute 2x = x + x, 3x = 2x + x, 4x = 3x + x, 5x = 4x + x, 6x = 5x + x, 7x = 6x + x, 14x = 7x + 7x, 28x = 14x + 14x, 61x = 28x + 28x + 5x, 122x = 61x + 61x, 244x = 122x + 122x, 490x = 244x + 244x + 2x, 980x = 490x + 490x, 1960x = 980x + 980x, 3926x = 1960x + 1960x + 6x, 7852x = 3926x + 3926x, 15704x = 7852x + 7852x, 31415x = 15704x + 15704x + 7x with 22 additions.
• Yao's addition chain for 31415, with Knuth's improvements. Given x, compute 2x = x + x, 4x = 2x + 2x, 8x = 4x + 4x, 16x = 8x + 8x, 32x = 16x + 16x, 64x = 32x + 32x, 128x = 64x + 64x, 256x = 128x + 128x, 512x = 256x + 256x, 1024x = 512x + 512x, 2048x = 1024x + 1024x, 4097x = 2048x + 2048x + x, 4105x = 4097x + 8x, 4617x = 4105x + 512x, 4681x = 4617x + 64x, 31415x = 4097x + 4105x + 4617x + 4617x + 4617x + 4681x + 4681x with 22 additions.
These circuits are transposes of each other, so it is unsurprising that they involve the same number of additions. Improvements to Brauer's chain (such as eliminating 4x and computing 5x as 2x + 3x) correspond to improvements in Yao's chain (such as ending with 4617x = 4105x + 512x, 9298x = 4617x + 4617x + 64x, 31415x = 4097x + 4105x + 4617x + 9298x + 9298x).

## Inventions and reinventions of the transposition principle

1956: One can build electrical circuits to perform linear computations. Bordewijk pointed out that transposing an electrical circuit produced a comparable-size electrical circuit for the transposed matrix. Bordewijk gave many examples of how a circuit could be understood together with its transpose.

Meanwhile, according to a subsequent paper by Kaminski, Kirkpatrick, and Bshouty, Lupanov pointed out the transposition principle for Boolean matrices. (Boolean matrices are nonnegative integer matrices modulo equivalence of all positive integers.) There's quite a bit of shoddy citation work in this area, as discussed below, so I'd really like to see an English translation of Lupanov's article.

1972: Fiduccia (Theorem 4) proved that a bilinear circuit for a matrix having m non-constant multiplications implied a bilinear circuit for the transposed matrix having m non-constant multiplications. A bilinear circuit is a generalization of a linear circuit; linear circuits are bilinear circuits where all multiplications are by constants. On the other hand, Fiduccia's theorem didn't say anything useful about linear circuits: the statement threw away all information about constant multiplications and additions, and the proof didn't make any effort to preserve the number of constant multiplications and additions.

1973: Fiduccia (pages 102-114) explained transposition for bilinear circuits in detail. Fiduccia proved (page 112, Theorems 4 and 5) that transposition preserved the number of non-constant multiplications, the number of constant multiplications (``dilations''), and ``the number of additions/subtractions plus the number of [nontrivial] outputs.'' This is a quite thorough statement of the transposition principle.

1976: Addition chains are linear circuits where all multiplications are by nonnegative integers and where multiplication by k is always implemented as k-1 additions; equivalently, linear circuits with all entries either 0, 1, or 2. Pippenger (page 259), discussing and improving upon Lupanov's results, commented that transposing an addition chain preserved the sum of the entries.

1981: Olivos, without the help of matrices, gave a laborious several-page proof that an addition chain for a p-by-1 matrix M of positive integers could be converted into an addition chain for M* having exactly p-1 more additions, and vice versa. Olivos cited Pippenger, but either failed to read, or failed to understand, Pippenger's comment about transposition. Anyway, Olivos's result is contained entirely within the result Fiduccia had published years earlier (never mind Bordewijk), so Olivos doesn't deserve any credit.

Knuth and Papadimitriou, following up on Olivos's paper, proved that an addition chain for a p-by-q matrix M of nonnegative integers, with all rows nonzero and all columns nonzero, could be converted into an addition chain for M* having exactly p-q more additions, and vice versa. Knuth and Papadimitriou said that they were using ``a graph-theoretic formulation of the problem---first employed by Pippenger''; but they either failed to read, or failed to understand, Pippenger's comment about transposition. Anyway, the Knuth-Papadimitriou result is contained entirely within the result Fiduccia had published years earlier (never mind Bordewijk), so Knuth and Papadimitriou don't deserve any credit.

1983: The transpose of a matrix M can be viewed as the Jacobian matrix of the map from x to Mx. Baur and Strassen proved, for more general maps f, that Jac(f) v can be computed in time comparable to f. The Baur-Strassen result allows a constant-factor slowdown, so it's worse than the previous results for M*. The possibility of using this algorithm to multiply by M* was mentioned briefly in a 1989 paper by Canny, Kaltofen, and Yagati, and used in a 1994 paper by Shoup. Shoup's paper introduced the name ``transposition principle.''

1988: Kaminski, Kirkpatrick, and Bshouty published transposition for linear circuits, calling it a ``modest generalization'' of the results by Lupanov and Pippenger. The Kaminski-Kirkpatrick-Bshouty result is identical to the constant-coefficient case of the result Fiduccia had published years earlier (never mind Bordewijk), so Kaminski, Kirkpatrick, and Bshouty don't deserve any credit.

I don't claim that this list is comprehensive: for example, a paper of Fischer and Warsitz lists several Jac(f) references going back to 1970.

## Misattributions of the transposition principle

A few people credit the transposition principle to Tellegen:
• A 1996 paper by Kaltofen (``Blocked iterative sparse linear system solvers for finite fields'') stated: ``By the transposition principle (Tellegen's theorem)  a black box for A can be converted to a black box for A* at no loss in time efficiency.'' Reference  is a 1970 book by Penfield, Spencer, and Duinker titled ``Tellegen's theorem and electrical networks.''
• A 1997 book by Buergisser, Clausen, and Shokrollahi states the transposition principle as ``(13.20) Tellegen's theorem'' on page 316. On page 349, the book says ``Tellegen's theorem (13.20), a basic result in network theory, see, e.g., [142, 416, 9], says that transposing the K-DAG for a matrix A gives a K-DAG for its transpose.'' Reference  is a 1969 book by Desoer and Kuh titled ``Basic circuit theory.'' Reference  is the 1970 book by Penfield et al. Reference  is a 1979 book by Antoniou titled ``Digital filters: analysis and design.''
• A 1997 paper by Villard (``Study of Coppersmith's algorithm''), after Theorem 9.1, gives similar credit to Tellegen.
• A 1997 paper by Eberly and Kaltofen states that the transposition principle is ``known in the control theory literature as Tellegen's theorem.''
• A 1998 paper by Kaltofen and Shoup says that the book by Buergisser et al. ``traced the transposition principle ... to Tellegen's theorem of control theory.''
• A 2003 paper by Bostan, Lecerf, and Schost, making heavy use of the transposition principle, is titled ``Tellegen's principle into practice.''

I see no justification for these credits to Tellegen. Bordewijk was a student of Tellegen, but Bordewijk's paper was not coauthored with Tellegen. ``Tellegen's theorem'' in control theory is not a principle about sizes of transposed circuits; it is the theorem that the total power of an electrical circuit is 0.

Here's what Antoniou's book actually says (second edition, Section 4.10.1). Consider a flow through a circuit; write x_{ij} for the flow into node j of the circuit via path ij, and write y_j for the flow out of node j. Similarly define x_{ij}' and y_j' for another flow. ``Tellegen's theorem'' then states that sum_i sum_j (y_jx_{ij}'-y_i'x_{ji}) = 0.

Antoniou states transposition in Section 4.10.4, but certainly doesn't call it ``Tellegen's theorem.''

Similarly, here's what the Desoer-Kuh book actually says. Page 392: Assign numbers v(e) and j(e) to each edge e in a circuit, in such a way that v satisfies Kirchhoff's voltage laws and j satisfies Kirchhoff's current laws. Page 393: ``Tellegen's theorem'' states that sum_e v(e)j(e) = 0. Page 396: Tellegen's theorem, when applied to voltages v and currents j, states that the circuit conserves energy. Page 398: Tellegen's theorem states that the circuit conserves complex power.

The Desoer-Kuh book discusses symmetric matrices in a long section starting on page 681, but I don't see any discussion of transposition for non-symmetric matrices. Again, I see no evidence that transposition is known as ``Tellegen's theorem'' in control theory.