# Scott Aaronson’s mini-course at UFF – 3rd lecture

Here are my notes on Scott’s third lecture, delivered this Wednesday, 18/12/2013.

Linear Optics: review

A set of n bosons can be described by what mathematicians call a symmetric product. An easier way of putting it is to think of a finite number n of photons, which can be found in m different modes. These modes can describe directions of propagation, position, or some other suitable property that can be used to discriminate them. If we have just a single photon, its state will be an superposition of all m modes with arbitrary complex amplitudes: $|\psi\rangle = \sum_{i=1}^m \alpha_i |i\rangle$.

Transformations on this single-photon state are described by a $m \times m$ unitary U. The role of circuit gates is played here by beam splitters and phase shifters, which implement an arbitrary unitary transformation on a dual-rail qubit (consisting of a single photon in a superposition of two modes). Here are some differences of this linear-optical set-up with respect to ordinary quantum computation:

1- In general quantum computing, a good approximation of an arbitrary n-qubit unitary requires an exponential number of gates. In linear optics, we need only $O(m^2)$ elementary optical elements to describe an arbitrary unitary, and the decomposition can be efficiently found using something similar to Gaussian elimination (Reck et al., 1994).

2- Given a quantum computation circuit, in general it’s hard to find elements of the unitary it specifies. In linear optics this becomes an easy task, given the efficient decomposition of an arbitrary linear-optical circuit into beam splitters and phase shifters. So we can talk interchangeably about the linear optical network and the unitary U that specifies it – this is justified as we can move easily between these two descriptions.

Observation: we could encode any computation on n qubits in a single photon and $2^n$ modes, but this is not an efficient encoding – the energy or other resource (such as required accuracy in dynamics or final measurements) would blow up exponentially. We need to restrict ourselves to doing things in such a way that all resources remain bounded by a polynomial in the number of photons n. This means that to get some advantage using linear optics, we’ll need to use multiple photons.

Multiple photons

Our multi-photon states are described by a list of mode occupation numbers $s = (s_1, s_2,...,s_m)$. Let n be the total number of photons, which is preserved in a linear-optical circuit. $\phi_{m,n}$ is defined as the set of all lists of m mode occupation numbers $(s_1,s_2, ..., s_m)$ such that they add up to n. Let M denote the number of different configurations of mode occupations, i.e. $M= |\phi_{m,n}|$. A simple exercise in combinatorics shows M is given by the binomial coefficient (m+n-1 choose n). This means M grows exponentially with n , $M \sim m^n$, i.e. this gives us an exponentially large Hilbert space. This is necessary for any advantage, but not sufficient, as can be clearly seen from known counter-examples (such as stabilizer states of n qubits operated on by Clifford circuits).

Our state of n photons in a linear-optical quantum computer is described by a superposition over all possible states in configuration space: $|\psi\rangle = \sum_{s \in \phi_{m,n}} \alpha_s |s\rangle$. We’ll be working in the “dilute limit” m >> n; in the future we’ll consider the situation $m \sim n^2$.

We’ll start with a standard initial state $\left| I \right\rangle = \left| 111....00000 \right\rangle$, with the first n modes having a single photon, and the remaining m-n modes with no photons. What do unitary transformation look like in this larger Hilbert space? Beautiful result: the unitary that determines single-photon dynamics determines also the multiple-photon dynamics. The mxm matrix U that describees the interferometer completely determines what the n photons do, which is specified by the MxM unitary matrix $\phi(U)$. This is a homomorphism, a representation of the unitary group.

Permanents and the permanent formula

Now we can give the formula for the transition amplitudes corresponding to n photons in the mxm interferometer U. Let S, T denote lists of occupation numbers $\in \phi_{n,m}$, for example associated with the input and output states. Then the transition amplitudes are given by:

$\left\langle S |\phi(U)|T\right\rangle= \frac{per(U_{S,T})}{\sqrt{s_1!...s_m!t_1!...t_m!}}$.

Matrix $U_{S,T}$ is obtained from U by taking $s_i$ copies of the $i$th row of U and $t_j$ copies of the $j$th column of U, for all $s_i,t_j$.

The appearance of the permanent function justifies a digression about it. Before doing that, let’s note there’s a similar formula for the amplitudes associated with linear interferometers for fermions, only there it is the determinant function that shows up.

The permanent is a well-known function to computer scientists, but maybe not so much to physicists. As a curious anecdote, in the new quantum mechanics textbook by Weinberg, he states that the dynamics of fermions requires the computation of the so-called Slater determinants; and the same holds for bosons, except that the determinant now doesn’t have minus signs. This is exactly the definition of the permanent.

The permanent is similar to the determinant in many respects, and very different in other respects, especially when we consider their computational complexities.

$per(A)= \sum_{\sigma \in S} \prod_{i=1..n} a_{i, \sigma_i}$. The sum is over all permutations; note the absence of minus signs for odd permutations. Scott then went through a couple of examples, calculating the permanent of a 2×2 and a 3×3 matrix.

We can give a recursive (although not efficient) formula for the permanent. Let $A_a... A_n$ be A’s bottom minors. Then $per(A)=a_{11}*per(A_1) + a_{12} per(A_2)+...+a_{1n}*per(A_n)$.

For the calculation of the determinant, Gaussian elimination can be used to reduce the matrix to diagonal form, and the operations used do not change the determinant. The complexity of a simple algorithm for the calculation of determinants is $O(n^3)$, with the best known (although not very practical) algorithm running in time $O(n^{2.373})$. The complexity of multiplying integers can be done in something close to $nlogn$ time, using the fast Fourier transform. This can’t be extended to matrices because they don’t commute.

Regarding the calculation of the permanent, the naive algorithm that relies directly on its definition has runtime $O(n!)$. The lack of minus signs in the definition changes everything with respect to the determinant. The best known algorithm for the permanent of a $n\times n$ matrix has runtime $O(n2^n)$, and is known as Ryser’s algorithm (actually this is a similar result, known as Glynn’s formula):

$per(A) = 1/2^{n} \sum x_1 x_2 ... x_n \prod (a_{i1} x_1 + ... + a_{in} x_n)$, where the sum is over all assignments of variables $x_j=\pm 1$, and the product is over $i=1, ..., n$ [sorry, some wordpress/latex limitations here].

Why does this formula work? All of the terms of the Glynn formula in which some x_i gets raised to a higher power than 2 can be observed to cancel each other out, because of the presence of the x_1…x_n term.  And the remaining terms (those involving x_1^2 … x_n^2) are precisely the permanent.

The formula requires $n^2 2^n$ operations, and this can come down to $n 2^n$ by being smarter about iterating over all assignments x_i, using Gray code order (flipping one bit at a time to loop more efficiently through all variable settings).

Let us consider some of the symmetries of the permanent and the determinant. We know how linear row and column operations affect the determinant, and this is what makes Gaussian elimination possible. For the permanent, we know that row or column permutations leave the permanent invariant. Multiplying any row or column by a constant scales the permanent by the same constant. In the case of the permanent, however, we can’t do Gaussian elmination, as it requires more general linear operations on rows/columns, which will change the value of the permanent.

Why are permanents so hard to compute? There’s a better explanation of this than just saying “we haven’t found a better algorithm so far”. In 1979, Leslie Valiant showed that calculating the permanent is #P-complete by reducing the problem of counting the solutions of an NP-complete problem to the calculation of a permanent. He proposed an encoding of the 3-SAT problem into a matrix, so that finding the number of satisfying variable assignments corresponded to calculating the matrix permanent.

There’s another well-known interpretation for the permanent. Given the adjacency matrix for a bipartite graph, the permanent corresponds to the number of different bipartite matchings there exist between the two sets of vertices.

Easy permanents

Are there classes of matrices for which the permanent is easier to compute? Here are two important results:

1- Jerrum-Sinclair-Vigoda found a poly-time randomized algorithm to approximate per(A) for nonnegative entries. While calculating these permanents exactly is still #P complete, they showed that a multiplicative approximation is possible, i.e. we can get an approximation $(1\pm \epsilon) per(A)$. The  proven runtime is $n^8$, but it’s believed this can be improved on.

2- In  2002 Gurvits showed that one can approximate per(A) in poly time, with a randomized algorithm, even if A is a matrix with complex entries. The approximation here, however, is done within additive error only, i.e. the error is at most $\pm \epsilon ||A||^n$ , where $||A||$ is the norm of A, denoting its largest singular value. The algorithm runs in $O(n^2/\epsilon^2)$ time. The basic idea behind Gurvits algorithm is relatively simple: take the Ryser algorithm, but instead of summing over all x’s, estimate the whole sum by picking $n^2$ samples randomly and uniformly.

So Gurvits’ algorithm sounds like a bad omen for the hope that linear optical circuits can be hard to simulate. There is, however, a catch: for almost any unitary U, the permanent will be exponentially small, so the additive approximation of Gurvits is no good – typically one might as well always estimate its as being zero!

Here’s an open problem: are there examples of unitaries which don’t look like the identity (i.e. don’t have a diagonal with large product), but which have large permanents?

Condensed matter parenthetical remark: finding ground states of fermionic systems is hard due to the sign problem (this task is easy for bosons). The situation is the opposite for dynamics. Moral: the negative sign can be both your friend or your enemy!

Going back to the permanent formula for boson sampling amplitudes, we can ask a few questions about it:

1- Why is it true?

2- Is $\phi(U)$ really a unitary?

3- Is the map from U to $\phi(U)$ a homomorphism? Does it make mathematical sense, i.e. is it true that $\phi(UV)=\phi(U)*\phi(V)$?

An easier way to check the truth of these 3 points is to turn to creation/anihilation operators, and obtain this formula with this approach. The connection works for a reason very much related to the reason that Ryser’s formula works; please check the discussion that accompanied Ryser’s formula above.

We call a state $\left|S \right\rangle$ “collision-free” if all $s_i$‘s are 0 or 1, i.e. there’s no photon bunching. From Scott’s point of view, the Pauli Exclusion Principle for fermions can be stated as follows: repeated rows results in zero determinant. For the permanent, repeated rows enlarge the permanent, hence boson bunching!

Bosonic birthday and permanents for classical particles

Speaking of bunching, let’s discuss the so-called bosonic birthday paradox, discussed by Scott and Arkhipov and Arkhipov and Kuperberg. The (classical) birthday paradox is this: if more than 23 people gather in a room, there’s a better-than-even chance that at least two will share a birthday. It is only called a paradox because most people would guess that this would require something like the number of different days of the year (365), when it only needs $O(\sqrt{N})$, where N=number of days. Bosonic bunching makes this effect stronger for bosons than for classical particles. Bosons can be spread randomly among m modes by a Haar-random interferometer.  If we have more than $2*n^2$ modes, the probability of bunching is at most a constant. If the number of modes is larger, we can almost guarantee events will be collision-free, so in a sense the bosonic birthday paradox is that bosons don’t bunch as much as we might have expected.

Even classical particles’ distributions can be described with permanents. Consider n balls that can fall randomly in m bins. What’s the probability of collision free outcomes? It is per(P), where $P_{ij}$ is the transition probability that a ball entering mode j will exit in mode i (this can be shown with a simple combinatorial argument). Note that in this case P has non-negative entries, and so the Jerrum-Sinclair-Vigoda (JSV) algorithm gives us a poly-time, multiplicative error estimate.

Question:  does the JSV algorithm give us a better estimate than we could get from running the experiment a poly(n) number of times? The answer is  Yes: note that if some probability was exponentially small, the JSV algorithm gives a  better estimate than the experiment, whose estimate would likely be zero due to the limited, polynomial number of runs.

Trying to solve hard problems

Can this “boson-permanent” connection help us solve hard problems?

This question was examined in 1996 by Troyansky and Tishby, who were the first to put together the two facts:

1- that the evolution of bosons is given by permanents, and

2- that permanents are #P complete.

They also stated that it was far from obvious that this would be enough to help us solve hard math problems. It’s not sufficient that the system’s evolution involves hard computations, we need to find a way to use the system to provide us solutions to well-specified, provably hard mathematical problems.

The first possible way that we could get bosons to give us a computational advantage would be to use them to directly compute permanents. Let us consider the question of how many experiment repeats we would need to obtain good estimates for permanents. Let’s take a matrix A with entries 0 and 1 only,  so unless A is a permutation, it won’t be unitary. Theorem: for any 0/1 nxn matrix A, if we take A/||A||, that re-scaled matrix can be embedded as a nxn submatrix of a 2nx2n unitary matrix. So it is possible to pick unitary U with a submatrix given by epsilon*A submatrix, for any A. Suppose I apply $\phi(U)$ to the standard initial state and find the transition probability $||^2$; by what we’ve discussed, this probability is given by $\epsilon^{2n} |perA|^2$. So we can always buid an experiment to learn per(A). The problem with this approach is that the re-scaling results in an exponentially small $\epsilon^{2n} ||A||$, and we can’t afford to repeat the experiment an exponential number of times. If permanents are large this is OK, but for these cases we can instead apply Gurvits’ algorithm. All of this was pointed out by Troyansky and Tishby.

Some more observations we can make:

1- How does the power of a boson sampling computer compares with a standard quantum computer? Boson sampling can be simulated in BQP, using standard boson simulation algorithms. To simulate n bosons, write the input state as the occupation number list S – this takes n*logn qubits. For the dynamics, note that each beam splitter acts on at most 2 s’s. Each beam splitter can then be decomposed in the chosen universal set using the Solovay-Kitaev algorithm, and we’re done simulating the boson sampling experiment.

2- It seems very unlikely we can do BQP with boson sampling. There’s no very strong complexity argument for that, but there’s an intuitive argument. Look at the Hilbert space size: in linear optics it is exponentially large, but the corresponding group of unitaries has a much smaller dimension than general quantum computation does. This argument can be formalized to show that there’s no gate-by-gate efficient translation from a circuit to boson sampling, but the possible loophole is that there could be some alternative translation scheme. As a matter of fact, it’s not even clear we can do universal classical computation with a boson sampling computer.

How can we use photonics to do anything hard for a classical computer? As pointed out by Troyansky and Tishby, direct estimation of permanents doesn’t work – if it did, $P^{\#P} = BQP$  – this would be an enormous collapse of complexity classes, implying the capability of solving NP complete problems with quantum computers, an unlikely prospect.

KLM: quantum computation with linear optics + measurements

In order to go further with a photonic computer, we could appeal to  nonlinear optics, so that we can have interactions between photons. Kerr media allows for two-qubit gates, and then arbitrary quantum circuits. Unfortunately, such non-linearities are weak and hard to implement.

In 2001 there was a breakthrough in photonic quantum computing, due to Knill, Laflamme and Milburn (which became known as the KLM scheme). In short, they’ve shown that

linear optics + adaptive measurements = BQP.

To see how this works, we start by encoding qubits in the so-called dual rail representation: we group our modes into pairs, and each pair is guaranteed to have only a single photon, thus encoding a qubit. Applying a beam splitter to each dual-rail pair results in arbitrary single-qubit gates.

How do we make these dual rail qubits interact, in order to obtain two-qubit gates? Assume we can make intermediate measurements, and do something different in the immediate future, something that may depend on the measurement outcomes (this is the adaptativity, also called feed-forward of measurement results). With that, KLM showed we can implement a two-qubit CSIGN gate. CSIGN gates together with arbitrary single-qubit gates results in universal quantum computation.

Doing adaptive measurement on photons is pretty hard because photons tend to fly away as fast as the speed of light… What if we don’t have adaptive measurements, what could we do then?

Boson sampling – exact version

Let’s step back, and forget about calculating permanents, and about implementing a universal quantum computer. Suppose that we have a linear-optical network with m input/output modes. Let us define our task to be: simulate what the network does. More precisely:

BosonSampling (exact version): given as input a mxn matrix A whose columns are orthonormal vectors. $D_A$ = distribution over all S in $\phi_{m,n}$ where prob(S) = $\frac{|per(A_S)|^2}{(s_1!*...*s_m!)}$. Task: output a sample from $D_A$.

Advantage: the linear-optical network naturally solves this problem. Doing something like this is similar to prove that dolphins are smart by observing them in their natural habitat, rather than by laboriously teaching them to do simple arithmetics.

Disadvantage: this, as far as we know today, does not solve useful problems…

The first argument showing that BosonSampling is hard follows from two facts:

Fact 1- Given an nxn matrix A of complex numbers, it’s #P-hard even to approximate its permanent, or to approximate $|per(A)|^2$ to within $1 \pm \epsilon$. The proof uses standard known tools in complexity theory. Let $A_x$ be A with the top-left entry replaced by $a_11 + x$.  Then use binary search over x, and the oracle for multiplicatively approximating the permanent, to find an x such that $|Per(A_x)|^2=0$.  By doing so (and solving a linear equation), one can reduce the calculation of an nxn permanent to the calculation of an (n-1)x(n-1) permanent — namely, the permanent of the bottom-right (n-1)x(n-1) submatrix of A.  Then one can recursively reduce that to calculating the permanent of an (n-2)x(n-2) matrix, and so on, until one reaches a 1×1 matrix, whose permanent is trivial to compute.

Fact 2- if there were a fast classical BosonSampling algorithm, then $|perA|^2$ would be approximable in $BPP^{NP}$. The proof uses standard techniques from the 1980’s, in particular a technique called “approximate counting”.

Putting these two facts together we get $BPP^{NP} = P^{\#P}$. Why is this bad? We saw yesterday that $BPP \subseteq NP^{NP}$. The proof of this implies that $BPP^{NP} \subseteq NP^{NP^{NP}}$. Toda’s theorem says that $PH \subseteq P^{\#P}$.

If $BPP^{NP} = P^{\#P}$ then $P^{\#P}=NP^{NP^{NP}}=PH$, and the polynomial hierarchy collapses to its third level.

Coming in the next talk: linear optics can be used to give an alternative proof that calculating permanents is #P-complete; even approximating boson sampling is hard (if we accept some reasonable conjectures).