Apply Hoeffding to figure out the probability a poll of 1000 people is more than 5% off in its estimate. Then specify the rule of thumb for picking a sample size to get within a certain error.

Here \(a_i=0,b_i=1\) since the variables are binary. Now we just plug everything in, where \(\mu\) is the true expected value.
\(\mathbb{P}[|X-\mu| \ge 50] \le 2e^{-\frac{2 \cdot 50^2}{1000}} = 2e^{-5} \approx 0.013\) 
so we are within 5% additive error with probability around 99%. 

A good rule of thumb is that if you have \(n\) independent binary random variables, you should expect an error on the estimate of the expected value of around \(\frac{1}{\sqrt{n}}\) percent. So, to get an error of 5%, compute \(\frac{1}{\sqrt{n}} = 0.05\), or \(\sqrt{n}=20, n=400\), then make \(n\) a bit bigger, which is why \(n=500\) sufficed. In general it's good to triple the value you get, but for large deviation (5% is large), it's OK to just make it a little bigger.



------
(Note that using Chernoff bounds can help give tighter results for when the \(\mathbb{E}[X_i]\) is closer to 0 or 1, but this is the best you can do when the expected value is around 0.5.)
Average Case Analysis

In average case analysis, we define a distribution over instances and then design algorithms that work with high probability for a random instance sampled from this distribution.

For example, we showed that Max Cut in the \(G(n,\frac{1}{2})\) model has an average case\(1-o(1)\) approximation. In particular, we showed that there is an algorithm which gets a guarantee of \(1-o(1)\) with probability close to 1 on a graph sampled from \(G(n,\frac{1}{2})\). 
Carathéodory's Theorem

If \(x \in \mathbb{R}^n\) is a point in a polytope \(P\), then we can write \(x\) as a convex combination of at most \(n+1\) vertices \(v_1,\dots,v_{n+1}\) of \(P\). 

In other words, we have \(x = \sum_{i=1}^{n+1} \lambda_i v_i\) where \(\sum_{i=1}^{n+1} \lambda_i = 1\) and \(\lambda_i \ge 0\) for all \(i\). 
Chebyshev's Inequality

Given a random variable \(X\), \(\mathbb{P}[|X - \mathbb{E}[X]| \ge k] \le \frac{\text{Var}(X)}{k^2}\). Unlike Markov, \(X\) does not have to be non-negative. 

 A slightly easier version to remember: \(\mathbb{P}[|X-\mathbb{E}[X]| \ge k\cdot \sigma] \le \frac{1}{k^2}\), where \(\sigma\) is the standard deviation of \(k\). In words: the probability \(X\) is more than \(k\) standard deviations away from its mean is at most \(\frac{1}{k^2}\). 
Christofides' Algorithm for Metric TSP

As discussed in class, it's enough to find a multi-set of edges that connects the graph and has even degree at every vertex, as this edge set can always be "shortcut" to a Hamiltonian cycle of no greater cost.

Find a minimum spanning tree \(T\) of the graph: this gives connectivity. Then, compute the minimum cost perfect matching on the nodes with odd degree in \(T\): this makes all the vertices even degree.

This is a \(\frac{3}{2}\) approximation.
Convex function

A function is convex if its epigraph is a convex set. The epigraph of a function is the set \(\{(a,b) \in X \times \mathbb{R} \mid r \ge f(x)\}\).

Another useful characterization is: a function is convex if its Hessian is positive semi-definite. Remember the Hessian \(\nabla^2\) is the matrix of partial derivatives with \(\nabla^2_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\). 
Convex set

A set \(S \subseteq \mathbb{R}^d\) is convex if for any two points \(a,b \in S\) and any \(\alpha \in [0,1]\), the point \(\alpha a + (1-\alpha)b\) also lies in \(S\). 

In this class, the primary example of a convex set will be a polyhedron: the intersection of finitely many half-spaces.
Covariance matrix

For a random vector \(x \in \mathbb{R}^n\), the covariance matrix of \(x\) is
\[\mathbb{E}[(x-\mathbb{E}[x])(x-\mathbb{E}[x])^T]\]
If the expected value of \(x\) is the 0 vector, this simplifies to
\[\mathbb{E}[xx^T]\]
Definition of Approximation Algorithm

An \(\alpha\)-approximation for an optimization problem is a polynomial time algorithm which produces a solution of cost within a factor of \(\alpha\) of the optimal solution for every instance.

A randomized \(\alpha\)-approximation instead produces a solution of expected cost within a factor of \(\alpha\) for every input.
Eigenvalue and eigenvector

An eigenvalue of a matrix \(A\) is a value \(\lambda \in \mathbb{R}\) so that \(Ax = \lambda x\) for some \(x \in \mathbb{R}^n\).

An eigenvector is a vector \(x \in \mathbb{R}^n\) so that there exists some \(\lambda \in \mathbb{R}\) so that \(Ax = \lambda x\). 

Therefore we naturally say \(\lambda,x\) form an eigenvalue, eigenvector pair if \(Ax=\lambda x\).
Ellipsoid method and separation oracle

Let \(P\) be a convex set. A separation oracle for \(P\) is a polynomial time procedure that given a point \(x \in \mathbb{R}^n\), either certifies that \(x \in P\) or returns a hyperplane that separates \(x\) from \(P\). In other words, it returns a vector \(a \in \mathbb{R}^n\) such that \(a^Tx > a^Ty\) for all \(y \in P\). 

The ellipsoid method is a polynomial time algorithm that can optimize over any convex set with a separation oracle. (In other words, it can solve any LP whose feasible region is convex and has a separation oracle.)

In this course, \(P\) will typically be a convex polyhedron defined by a (possibly exponential size) collection of hyperplanes, and our separation oracle will return one of these hyperplanes.
Expected Value and Linearity of Expectation

For a discrete random variable \(X\),
\[\mathbb{E}[X] = \sum_{\omega \in \Omega} X(\omega) \cdot \mathbb{P}(\omega)\]Alternatively, \(\mathbb{E}[X] = \sum_{x \in R_X} x \cdot \mathbb{P}[X=x]\), where \(R_X\) is the range of \(X\). This is often more useful. 

For a continuous random variable with probability density function \(f\),
\[\mathbb{E}[X] = \int_{-\infty}^\infty x \cdot f(x) dx\]Linearity of expectation means that for any variables \(X_1,\dots,X_n\), we have
\[\mathbb{E}[\sum_{i=1}^n X_i] = \sum_{i=1}^n \mathbb{E}[X]\]
Hoeffding's Inequality

Let \(X_1,\dots,X_n\) be independent random variables so that for all \(i\), we have \(a_i \le X_i \le b_i\) with probability 1. Let \(X=\sum_{i=1}^n X_i\). Then, \[\mathbb{P}[X - \mathbb{E}[X] \ge t] \le e^{-\frac{2t^2}{\sum_{i=1}^n(b_i-a_i)^2}}\]So, we have
\[\mathbb{P}[|X - \mathbb{E}[X]| \ge t] \le 2e^{-\frac{2t^2}{\sum_{i=1}^n(b_i-a_i)^2}}\]
Integer Linear Program and Linear Program Relaxation

An Integer Linear Program (ILP) minimizes a linear objective function over a convex polyhedron \(P\). The variables are constrained to be integers. \[\begin{align*} \min c^Tx \\ x \in P \\ x \in \mathbb{Z}_{\ge 0}^n \end{align*}\]If we instead only constrain the variables to be real numbers, we get a Linear Program (LP):\[\begin{align*} \min c^Tx \\ x \in P \\ x \in \mathbb{R}_{\ge 0}^n \end{align*}\]Note we can flip min to max by negating the entries of \(c\), so these programs are fully general. 

A natural extension is convex programs. Here we only need the feasible region to be a convex set and the objective function to be convex.
Integrality Gap

The integrality gap of an ILP and its associated LP is the worst case ratio between the ILP OPT and the LP OPT over all instances.
Law of Large Numbers

Let \(X_1,X_2,\dots, X_n\) be a sequence of independent and identically distributed random variables with \(\mathbb{E}[X_i] = \mu\). Then if \(\overline{X_n} = \frac{1}{n}\sum_{i=1}^n X_i\) is the sample mean, \(\overline{X_n} \to \mu\) as \(n \to \infty\). 

There are two ways to state this convergence. One is the weak law, which is the only one you need to know. This says that for all \(\epsilon > 0\), \(\lim_{n \to \infty} \mathbb{P}[|\overline{X_n}-\mu| > \epsilon] = 0\). 



------
In case you're curious, the other is the strong law, which is in fact slightly different. It can be stated \(\mathbb{P}[\lim_{n \to\infty} \overline{X_n} = \mu] = 1\). This means that the probability that the sequence of random variables obtained converges to \(\mu\) is 1. There can be some sequences that do not converge, for example the sequence \(0,0,0,\dots\) for independent Bernoullis with success probabablity great than 1, but the strong law says the probability of obtaining such a sequence is 0. 

The weak law does not preclude the possibility that every realization of the sequence \(X_1,X_2,\dots\) you obtain does not have \(\mu\) as a limit: it could jump up to values much larger than \(\mu\) infinitely many times. The strong law says that with probability 1 this does not occur. 
Linear algebraic view of a vertex of a polyhedron

If \(v\) is a vertex of a polyhedron in \(n\) dimensions defined by half-spaces \(Ax \ge b\), then there is a matrix \(\tilde{A}\) consisting of \(n\) linearly independent rows of \(A\) so that \(v\) is the unique solution to the equation system \(\tilde{A}x = \tilde{b}\).

It is useful to notice that the subset of rows picked in \(\tilde{A}\) must be tight constraints.
Markov's Inequality

Given a non-negative random variable \(X\), we have \(\mathbb{P}[X \ge k] \le \frac{\mathbb{E}[X]}{k}\). 

Another slightly more memorable way to write this is \(\mathbb{P}[X \ge k \cdot \mathbb{E}[X]] \le \frac{1}{k}\), or in words, the probability a non-negative random variable is at least \(k\) times its expectation is at most \(\frac{1}{k}\).
Metric space

A metric space \(M,d\) is defined by a set of elements \(M\) with a metric \(d: M \times M \to \mathbb{R}_{\ge 0}\) with the following four properties:

1. \(d(x,x) = 0\) for all \(x \in M\): all points have distance 0 to themselves
2. If \(x \not= y\) then \(d(x,y) > 0\) (although this can be made arbitrarily small so often we allow a distance of 0)
3. \(d(x,y) = d(y,x)\) for all \(x,y \in M\)
4. \(d\) obeys the triangle inequality: \(d(x,z) \le d(x,y) + d(y,z)\) for all \(x,y,z \in M\). 

A standard example of a metric space is Euclidean space, where points are tuples \((x_1,x_2) \in \mathbb{R}^2\) and \(d((x_1,x_2),(y_1,y_2)) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}\). 
Minimum spanning tree

Given a graph \(G=(V,E)\) with weights \(w_e\) on the edges, a minimum spanning tree \(T\) is a spanning tree the the smallest possible weight \(\sum_{e \in T} w_e\). 

A minimum spanning tree (MST) can be found using a greedy procedure called Kruskal's algorithm: keep adding the edge with the smallest weight to your MST as long as it does not create a cycle. Stop when every edge creates a cycle, at which point you must have a spanning tree if the original graph was connected.
PSD Matrix

A symmetric matrix \(A \in \mathbb{R}^{n \times n}\) is positive semidefinite, or PSD, if one of three equivalent conditions hold:
(1) \(A\) has no negative eigenvalues
(2) \(x^TAx \ge 0\) for all \(x \in \mathbb{R}^n\)
(3) \(A\) has a square root \(A^{1/2}\) so that \(A^{1/2}(A^{1/2})^T = A\).
PTAS, FPTAS, and APX

A polynomial time approximation scheme (PTAS) is a (\(1+\epsilon)\)-approximation for a problem that runs in polynomial time for every fixed \(\epsilon>0\). For example, a PTAS could run in time \(n^{1/\epsilon}\). 

A fully polynomial time approximation schme (FPTAS) is a \((1+\epsilon)\)-approximation for a problem that runs in time polynomial in the input size and \(\frac{1}{\epsilon}\). For example, \(\frac{n^3}{\epsilon}\). 

For maximization problems the \(1+\epsilon\) would be replaced with \(1-\epsilon\).

A problem is in APX ("approximable") if it has a constant-factor approximation algorithm. A problem is APX-Hard if it has no PTAS unless P=NP. 
Polyhedron and polytope and their matrix form

A polyhedron is the intersection of a finite number of halfspaces. A polytope is a bounded polyhedron.

We will typically write a polyhedron as the set of points \(x \in \mathbb{R}^n\) obeying the equations \(Ax \ge b\) for some matrix \(A \in \mathbb{R}^{m \times n}\) and vector \(b \in \mathbb{R}^m\). Each row of \(A\) represents one of the halfspaces of the polyhedron.
Proof that every PSD matrix \(Y\) has a distribution with covariance matrix \(Y\)

Let \(r_1,\dots,r_n\) be independent random variables where \(r_i\) mean 0 and variance 1. Then, let \(x=Y^{1/2}r\) be your random vector. Then:
\[\mathbb{E}[xx^T] = \mathbb{E}[Y^{1/2}rr^TY^{1/2}]=Y^{1/2}\mathbb{E}[rr^T]Y^{1/2}=Y\]where we used that \(\mathbb{E}[rr^T] = I\). 
Psuedopolynomial time

An algorithm runs in pseudopolynomial time if it runs in polynomial time in the input size when the numbers in the input are encoded in unary.
Rank-Nullity Theorem

Given a matrix \(A \in \mathbb{R}^{m \times n}\), 
\[\text{rank}(A) + \text{nullity}(A) = n\]where the rank of a matrix is the maximum number of linearly independent rows (or columns) and the nullity is the maximum number of linearly independent vectors \(x\) for which \(Ax=0\) (in other words, the rank of the null space of \(A\)). 
Relax and round framework

Start with an ILP whose feasible solutions are the feasible solutions for your optimization problem. Make sure the linear constraints of the ILP have a poly-time separation oracle. Then, relax the ILP to an LP, which is polynomial time solvable by the ellipsoid method. After solving this LP, round the resulting solution to a feasible integer solution, ideally without losing too much in cost (or whatever measure is of interest).

This can also be applied for more general convex programs.
Spectral Norm of a Symmetric Matrix

Given a symmetric matrix \(A \in \mathbb{R}^{n \times n}\), the spectral norm \(||A||_2\) is the value of the largest eigenvalue in absolute value. In other words \(||A||_2 = \max\{|\lambda_1|,|\lambda_n|\}\).

We also can show:
\[||A||_2 = \max_{x \in \mathbb{R}^n, x \not= 0} \frac{|x^TAx|}{x^Tx}\]so, for all \(x \in \mathbb{R}^n\), 
\[|x^TAx| \le ||A||_2 \cdot ||x||_2^2\]
Spectral Theorem

Given a symmetric matrix \(A\), there are eigenvalues \(\lambda_1,\dots,\lambda_n\) and corresponding orthonormal eigenvectors \(v_1,\dots,v_n\) such that 
\[A = \sum_{i=1}^n \lambda_i v_iv_i^T\]Where recall \(v_1,\dots,v_n\) is orthonormal if \(\langle v_i, v_j\rangle = 0\) for \(i \not= j\) and \(\langle v_i, v_i \rangle = 1\).
Variance and Variance Properties

The variance of a random variable \(X\) is \(\mathbb{E}[(X-\mathbb{E}[X])^2]\). This is also equal to the often more convenient expression \(\mathbb{E}[X^2] - \mathbb{E}[X]^2\). 

We have \(\text{Var}(\sum_{i=1}^n X_i) = \sum_{i=1}^n \text{Var}(X_i)\) if \(X_1,\dots,X_n\) are pairwise independent. Also, \(\text{Var}(aX+b) = a^2\text{Var}(X)\) for \(a,b \in \mathbb{R}\).