Algorithms with predictions

In the algorithms with predictions framework, there is a machine learning model that predicts some unknown property, for example the future inputs of an online problem or the approximate location of an item in a binary search instance.

An algorithm is \(\beta\) robust if the competitive ratio or approximation ratio algorithm performs no worse than \(\beta\) no matter the quality of the predictions. It is \(\alpha\) consistent if as the prediction error tends to 0, the algorithm achieves a ratio of \(\alpha\). 
An example of a 2-universal hash family

Given a universe \(U\) and a range \([k]\), pick a prime \(p \ge |U|\). To generate our random hash function \(h:U \to [k]\), pick \(a \in \{1,\dots,p-1\}\) and \(b \in [p]\) independetly and uniformly at random. Let \(f_{a,b}(x) = ax+b \pmod{p}\). Then our random hash function is \(h_{a,b}(x) = f_{a,b}(x) \pmod{k}\). 
Apply Hoeffding to figure out the probability a poll of 1000 people is more than 5% off in its estimate. Then specify the rule of thumb for picking a sample size to get within a certain error.

Here \(a_i=0,b_i=1\) since the variables are binary. Now we just plug everything in, where \(\mu\) is the true expected value.
\(\mathbb{P}[|X-\mu| \ge 50] \le 2e^{-\frac{2 \cdot 50^2}{1000}} = 2e^{-5} \approx 0.013\) 
so we are within 5% additive error with probability around 99%. 

A good rule of thumb is that if you have \(n\) independent binary random variables, you should expect an error on the estimate of the expected value of around \(\frac{1}{\sqrt{n}}\) percent. So, to get an error of 5%, compute \(\frac{1}{\sqrt{n}} = 0.05\), or \(\sqrt{n}=20, n=400\), then make \(n\) a bit bigger, which is why \(n=500\) sufficed. In general it's good to triple the value you get, but for large deviation (5% is large), it's OK to just make it a little bigger.



------
(Note that using Chernoff bounds can help give tighter results for when the \(\mathbb{E}[X_i]\) is closer to 0 or 1, but this is the best you can do when the expected value is around 0.5.)
Average Case Analysis

In average case analysis, we define a distribution over instances and then design algorithms that work with high probability for a random instance sampled from this distribution.

For example, we showed that Max Cut in the \(G(n,\frac{1}{2})\) model has an average case\(1-o(1)\) approximation. In particular, we showed that there is an algorithm which gets a guarantee of \(1-o(1)\) with probability close to 1 on a graph sampled from \(G(n,\frac{1}{2})\). 
Carathéodory's Theorem

If \(x \in \mathbb{R}^n\) is a point in a polytope \(P\), then we can write \(x\) as a convex combination of at most \(n+1\) vertices \(v_1,\dots,v_{n+1}\) of \(P\). 

In other words, we have \(x = \sum_{i=1}^{n+1} \lambda_i v_i\) where \(\sum_{i=1}^{n+1} \lambda_i = 1\) and \(\lambda_i \ge 0\) for all \(i\). 
Chebyshev's Inequality

Given a random variable \(X\), \(\mathbb{P}[|X - \mathbb{E}[X]| \ge k] \le \frac{\text{Var}(X)}{k^2}\). Unlike Markov, \(X\) does not have to be non-negative. 

 A slightly easier version to remember: \(\mathbb{P}[|X-\mathbb{E}[X]| \ge k\cdot \sigma] \le \frac{1}{k^2}\), where \(\sigma\) is the standard deviation of \(k\). In words: the probability \(X\) is more than \(k\) standard deviations away from its mean is at most \(\frac{1}{k^2}\). 
Christofides' Algorithm for Metric TSP

As discussed in class, it's enough to find a multi-set of edges that connects the graph and has even degree at every vertex, as this edge set can always be "shortcut" to a Hamiltonian cycle of no greater cost.

Find a minimum spanning tree \(T\) of the graph: this gives connectivity. Then, compute the minimum cost perfect matching on the nodes with odd degree in \(T\): this makes all the vertices even degree.

This is a \(\frac{3}{2}\) approximation.
Competitive Ratio of an Online Algorithm

The competitive ratio of a (possibly randomized) online algorithm is the worst case ratio over all instances of the expected value of the algorithm's performance divided by the value of the offline optimal solution.
Convex function

A function is convex if its epigraph is a convex set. The epigraph of a function is the set \(\{(a,b) \in X \times \mathbb{R} \mid r \ge f(x)\}\).

Another useful characterization is: a function is convex if its Hessian is positive semi-definite. Remember the Hessian \(\nabla^2\) is the matrix of partial derivatives with \(\nabla^2_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}\). 
Convex set

A set \(S \subseteq \mathbb{R}^d\) is convex if for any two points \(a,b \in S\) and any \(\alpha \in [0,1]\), the point \(\alpha a + (1-\alpha)b\) also lies in \(S\). 

In this class, the primary example of a convex set will be a polyhedron: the intersection of finitely many half-spaces.
Covariance matrix

For a random vector \(x \in \mathbb{R}^n\), the covariance matrix of \(x\) is
\[\mathbb{E}[(x-\mathbb{E}[x])(x-\mathbb{E}[x])^T]\]
If the expected value of \(x\) is the 0 vector, this simplifies to
\[\mathbb{E}[xx^T]\]
Definition of Approximation Algorithm

An \(\alpha\)-approximation for an optimization problem is a polynomial time algorithm which produces a solution of cost within a factor of \(\alpha\) of the optimal solution for every instance.

A randomized \(\alpha\)-approximation instead produces a solution of expected cost within a factor of \(\alpha\) for every input.
Dual of a Linear Program

Given a linear program:
\[\begin{align*} \max c^Tx \\ Ax \le b \\ x \in \mathbb{R}_{\ge 0}^n \end{align*}\]
The dual of that program is the result of adding up copies of every constraint in \(A\). For each constraint \(A_ix \le b\), where \(A_i\) is the \(i\)th row of \(A\), we create a variable \(y_i\) indicating that we add \(y_i\) copies of this constraint. This will be one big equation with a righthand side of \(b^Ty\), so to get the best upper bound on the primal, we want to minimize this bound. For it to be a valid bound, we need to have a coefficient of at least \(c\) for every variable in the lefthand side of this equation. 

So, the dual of that linear program is:
\[\begin{align*} \min b^Ty \\ A^Ty \ge c \\ y \in \mathbb{R}_{\ge 0}^n \end{align*}\]
Eigenvalue and eigenvector

An eigenvalue of a matrix \(A\) is a value \(\lambda \in \mathbb{R}\) so that \(Ax = \lambda x\) for some \(x \in \mathbb{R}^n\).

An eigenvector is a vector \(x \in \mathbb{R}^n\) so that there exists some \(\lambda \in \mathbb{R}\) so that \(Ax = \lambda x\). 

Therefore we naturally say \(\lambda,x\) form an eigenvalue, eigenvector pair if \(Ax=\lambda x\).
Ellipsoid method and separation oracle

Let \(P\) be a convex set. A separation oracle for \(P\) is a polynomial time procedure that given a point \(x \in \mathbb{R}^n\), either certifies that \(x \in P\) or returns a hyperplane that separates \(x\) from \(P\). In other words, it returns a vector \(a \in \mathbb{R}^n\) such that \(a^Tx > a^Ty\) for all \(y \in P\). 

The ellipsoid method is a polynomial time algorithm that can optimize over any convex set with a separation oracle. (In other words, it can solve any LP whose feasible region is convex and has a separation oracle.)

In this course, \(P\) will typically be a convex polyhedron defined by a (possibly exponential size) collection of hyperplanes, and our separation oracle will return one of these hyperplanes.
Expected Value and Linearity of Expectation

For a discrete random variable \(X\),
\[\mathbb{E}[X] = \sum_{\omega \in \Omega} X(\omega) \cdot \mathbb{P}(\omega)\]Alternatively, \(\mathbb{E}[X] = \sum_{x \in R_X} x \cdot \mathbb{P}[X=x]\), where \(R_X\) is the range of \(X\). This is often more useful. 

For a continuous random variable with probability density function \(f\),
\[\mathbb{E}[X] = \int_{-\infty}^\infty x \cdot f(x) dx\]Linearity of expectation means that for any variables \(X_1,\dots,X_n\), we have
\[\mathbb{E}[\sum_{i=1}^n X_i] = \sum_{i=1}^n \mathbb{E}[X]\]
Hoeffding's Inequality

Let \(X_1,\dots,X_n\) be independent random variables so that for all \(i\), we have \(a_i \le X_i \le b_i\) with probability 1. Let \(X=\sum_{i=1}^n X_i\). Then, \[\mathbb{P}[X - \mathbb{E}[X] \ge t] \le e^{-\frac{2t^2}{\sum_{i=1}^n(b_i-a_i)^2}}\]So, we have
\[\mathbb{P}[|X - \mathbb{E}[X]| \ge t] \le 2e^{-\frac{2t^2}{\sum_{i=1}^n(b_i-a_i)^2}}\]
Integer Linear Program and Linear Program Relaxation

An Integer Linear Program (ILP) minimizes a linear objective function over a convex polyhedron \(P\). The variables are constrained to be integers. \[\begin{align*} \min c^Tx \\ x \in P \\ x \in \mathbb{Z}_{\ge 0}^n \end{align*}\]If we instead only constrain the variables to be real numbers, we get a Linear Program (LP):\[\begin{align*} \min c^Tx \\ x \in P \\ x \in \mathbb{R}_{\ge 0}^n \end{align*}\]Note we can flip min to max by negating the entries of \(c\), so these programs are fully general. 

A natural extension is convex programs. Here we only need the feasible region to be a convex set and the objective function to be convex.
Integrality Gap

The integrality gap of an ILP and its associated LP is the worst case ratio between the ILP OPT and the LP OPT over all instances.
Johnson Lindenstrauss Lemma

The JL lemma shows given \(n\) points in \(\mathbb{R}^d\), if we only care about their \(\ell_2\) distances (and up to a small approximation factor \(\epsilon\)), we can project them down to points in \(\mathbb{R}^k\) for \(k \in O(\log(n)/\epsilon^2)\).

Formally, given \(x_1,\dots,x_n \in \mathbb{R}^d\) and \(\epsilon > 0\), we can choose \(k \in O(\log(n)/\epsilon^2)\) and a matrix \(P \in \mathbb{R}^{k \times d}\) so that if \(y_i=Px_i\) for all \(1 \le i \le n\), the points \(y_1,\dots,y_n\) approximately preserve all the \(\ell_2\) distances of \(x_1,\dots,x_n\) so that for all \(i,j\) we have:
\[(1-\epsilon)\|x_i-x_j\|_2 \le \|y_i-y_j\|_2\le (1+\epsilon)\|x_i-x_j\|_2\]
Law of Large Numbers

Let \(X_1,X_2,\dots, X_n\) be a sequence of independent and identically distributed random variables with \(\mathbb{E}[X_i] = \mu\). Then if \(\overline{X_n} = \frac{1}{n}\sum_{i=1}^n X_i\) is the sample mean, \(\overline{X_n} \to \mu\) as \(n \to \infty\). 

There are two ways to state this convergence. One is the weak law, which is the only one you need to know. This says that for all \(\epsilon > 0\), \(\lim_{n \to \infty} \mathbb{P}[|\overline{X_n}-\mu| > \epsilon] = 0\). 



------
In case you're curious, the other is the strong law, which is in fact slightly different. It can be stated \(\mathbb{P}[\lim_{n \to\infty} \overline{X_n} = \mu] = 1\). This means that the probability that the sequence of random variables obtained converges to \(\mu\) is 1. There can be some sequences that do not converge, for example the sequence \(0,0,0,\dots\) for independent Bernoullis with success probabablity great than 1, but the strong law says the probability of obtaining such a sequence is 0. 

The weak law does not preclude the possibility that every realization of the sequence \(X_1,X_2,\dots\) you obtain does not have \(\mu\) as a limit: it could jump up to values much larger than \(\mu\) infinitely many times. The strong law says that with probability 1 this does not occur. 
Linear algebraic view of a vertex of a polyhedron

If \(v\) is a vertex of a polyhedron in \(n\) dimensions defined by half-spaces \(Ax \ge b\), then there is a matrix \(\tilde{A}\) consisting of \(n\) linearly independent rows of \(A\) so that \(v\) is the unique solution to the equation system \(\tilde{A}x = \tilde{b}\).

It is useful to notice that the subset of rows picked in \(\tilde{A}\) must be tight constraints.
Locality sensitive hashing

Given a radius \(r\), a distance function \(d\), an approximation factor \(\alpha \ge 1\), and probabilities \(p_{close},p_{far}\), we say a distribution over hash functions \(h:\mathbb{R}^d \to \mathbb{Z}\) is \(\rho\)-locality sensitive if:
* For all \(x,y \in \mathbb{R}^d\) with \(d(x,y) \le r\) we have \(\mathbb{P}[h(x)=h(y)] \ge p_{close}\)
* For all \(x,y \in \mathbb{R}^d\) with \(d(x,y) \ge \alpha r\) we have \(\mathbb{P}[h(x)=h(y)] \le p_{far}\)
for \(p_{close} > p_{far}\) with \(p_{close} = p_{far}^\rho\). 

By creating many copies of an LSH, it can be used to solve the radius-constrained approximate nearest neighbors problem, where you want to find a neighbor of distance \(\alpha r\) if one exists within distance \(r\). (This radius-constrained version can be in turn used to solve the approximate nearest neighbors problem.)
Markov's Inequality

Given a non-negative random variable \(X\), we have \(\mathbb{P}[X \ge k] \le \frac{\mathbb{E}[X]}{k}\). 

Another slightly more memorable way to write this is \(\mathbb{P}[X \ge k \cdot \mathbb{E}[X]] \le \frac{1}{k}\), or in words, the probability a non-negative random variable is at least \(k\) times its expectation is at most \(\frac{1}{k}\).
Metric space

A metric space \(M,d\) is defined by a set of elements \(M\) with a metric \(d: M \times M \to \mathbb{R}_{\ge 0}\) with the following four properties:

1. \(d(x,x) = 0\) for all \(x \in M\): all points have distance 0 to themselves
2. If \(x \not= y\) then \(d(x,y) > 0\) (although this can be made arbitrarily small so often we allow a distance of 0)
3. \(d(x,y) = d(y,x)\) for all \(x,y \in M\)
4. \(d\) obeys the triangle inequality: \(d(x,z) \le d(x,y) + d(y,z)\) for all \(x,y,z \in M\). 

A standard example of a metric space is Euclidean space, where points are tuples \((x_1,x_2) \in \mathbb{R}^2\) and \(d((x_1,x_2),(y_1,y_2)) = \sqrt{(x_1-y_1)^2 + (x_2-y_2)^2}\). 
Minimum spanning tree

Given a graph \(G=(V,E)\) with weights \(w_e\) on the edges, a minimum spanning tree \(T\) is a spanning tree the the smallest possible weight \(\sum_{e \in T} w_e\). 

A minimum spanning tree (MST) can be found using a greedy procedure called Kruskal's algorithm: keep adding the edge with the smallest weight to your MST as long as it does not create a cycle. Stop when every edge creates a cycle, at which point you must have a spanning tree if the original graph was connected.
Nearest neighbor problem and the naïve algorithm

In the nearest neighbor problem we have a set of \(n\) points \(S \subseteq \mathbb{R}^d\) and a distance function \(d: \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}_{\ge 0}\). Given an input \(x\), the goal is to find the point \(y \in S\) which minimizes \(d(x,y)\). 

The naïve algorithm is to iterate over all \(n\) points and record the closest one. However this is too slow if \(n\) is huge.
PSD Matrix

A symmetric matrix \(A \in \mathbb{R}^{n \times n}\) is positive semidefinite, or PSD, if one of three equivalent conditions hold:
(1) \(A\) has no negative eigenvalues
(2) \(x^TAx \ge 0\) for all \(x \in \mathbb{R}^n\)
(3) \(A\) has a square root \(A^{1/2}\) so that \(A^{1/2}(A^{1/2})^T = A\).
PTAS, FPTAS, and APX

A polynomial time approximation scheme (PTAS) is a (\(1+\epsilon)\)-approximation for a problem that runs in polynomial time for every fixed \(\epsilon>0\). For example, a PTAS could run in time \(n^{1/\epsilon}\). 

A fully polynomial time approximation schme (FPTAS) is a \((1+\epsilon)\)-approximation for a problem that runs in time polynomial in the input size and \(\frac{1}{\epsilon}\). For example, \(\frac{n^3}{\epsilon}\). 

For maximization problems the \(1+\epsilon\) would be replaced with \(1-\epsilon\).

A problem is in APX ("approximable") if it has a constant-factor approximation algorithm. A problem is APX-Hard if it has no PTAS unless P=NP. 
Polyhedron and polytope and their matrix form

A polyhedron is the intersection of a finite number of halfspaces. A polytope is a bounded polyhedron.

We will typically write a polyhedron as the set of points \(x \in \mathbb{R}^n\) obeying the equations \(Ax \ge b\) for some matrix \(A \in \mathbb{R}^{m \times n}\) and vector \(b \in \mathbb{R}^m\). Each row of \(A\) represents one of the halfspaces of the polyhedron.
Primal dual algorithm

In a primal dual algorithm, you iteratively construct a primal integral solution \(x\). Whenever you add value to \(x\), you add the same amount of value to a dual solution \(y\). You then guarantee that at the end of the algorithm, \(\alpha y\) is a feasible dual solution. This ensures your algorithm is \(\alpha\) competitive (if it's an online problem) or \(\alpha\) approximate (for a normal approximation algorithm). 
Proof that every PSD matrix \(Y\) has a distribution with covariance matrix \(Y\)

Let \(r_1,\dots,r_n\) be independent random variables where \(r_i\) mean 0 and variance 1. Then, let \(x=Y^{1/2}r\) be your random vector. Then:
\[\mathbb{E}[xx^T] = \mathbb{E}[Y^{1/2}rr^TY^{1/2}]=Y^{1/2}\mathbb{E}[rr^T]Y^{1/2}=Y\]where we used that \(\mathbb{E}[rr^T] = I\). 
Prove that the probability a random point in \(B_r^d\) lies in \(B_{(1-\epsilon)}^d\) goes to 0 as \(d \to \infty\)

In other words, we are showing that as \(d \to \infty\), the mass of the ball moves towards its boundary. Here we use that the volume formula for a ball in \(d\) dimensions with radius \(r\) is a constant depending on \(d\) times \(r^d\). 

For the proof, we just need to show that the ratio of the volume of the \((1-\epsilon)r\) ball over the \(r\) ball goes to 0. And it does:
\[\frac{\text{Vol}(B_{(1-\epsilon)r}^d)}{\text{Vol}(B_r^d)} = \frac{((1-\epsilon)r)^d}{r^d} = (1-\epsilon)^d\]
Psuedopolynomial time

An algorithm runs in pseudopolynomial time if it runs in polynomial time in the input size when the numbers in the input are encoded in unary.
Rank-Nullity Theorem

Given a matrix \(A \in \mathbb{R}^{m \times n}\), 
\[\text{rank}(A) + \text{nullity}(A) = n\]where the rank of a matrix is the maximum number of linearly independent rows (or columns) and the nullity is the maximum number of linearly independent vectors \(x\) for which \(Ax=0\) (in other words, the rank of the null space of \(A\)). 
Relax and round framework

Start with an ILP whose feasible solutions are the feasible solutions for your optimization problem. Make sure the linear constraints of the ILP have a poly-time separation oracle. Then, relax the ILP to an LP, which is polynomial time solvable by the ellipsoid method. After solving this LP, round the resulting solution to a feasible integer solution, ideally without losing too much in cost (or whatever measure is of interest).

This can also be applied for more general convex programs.
Spectral Norm of a Symmetric Matrix

Given a symmetric matrix \(A \in \mathbb{R}^{n \times n}\), the spectral norm \(||A||_2\) is the value of the largest eigenvalue in absolute value. In other words \(||A||_2 = \max\{|\lambda_1|,|\lambda_n|\}\).

We also can show:
\[||A||_2 = \max_{x \in \mathbb{R}^n, x \not= 0} \frac{|x^TAx|}{x^Tx}\]so, for all \(x \in \mathbb{R}^n\), 
\[|x^TAx| \le ||A||_2 \cdot ||x||_2^2\]
Spectral Theorem

Given a symmetric matrix \(A\), there are eigenvalues \(\lambda_1,\dots,\lambda_n\) and corresponding orthonormal eigenvectors \(v_1,\dots,v_n\) such that 
\[A = \sum_{i=1}^n \lambda_i v_iv_i^T\]Where recall \(v_1,\dots,v_n\) is orthonormal if \(\langle v_i, v_j\rangle = 0\) for \(i \not= j\) and \(\langle v_i, v_i \rangle = 1\).
The 2-universal property of hash functions

Let \(h: U \to [k]\) be a random hash function. Then \(h\) is 2-universal if over the randomness of \(h\), for any \(x \not= y \in U\), we have \(\mathbb{P}[h(x)=h(y)] \le \frac{1}{k}\). 

This is weaker than the pairwise independence property, which would require that for any \(x \not=y \in U\) and any \(a,b \in [k]\) we have \(\mathbb{P}[h(x)=a,h(y)=b] = \frac{1}{k^2}\). 
The Secretary Problem

In the secretary problem, there are \(n\) candidates for a job. Each candidate \(i\) has some value \(v_i\), where we can assume the values are unique for simplicity. 

Upon interviewing a candidate, we can either hire them or permanently pass on their application. The candidates will arrive in a uniformly random order. The question is: with what probability can we ensure we select the best candidate? And the answer is \(1/e\), or about 37%, as \(n \to \infty\). The strategy that works is to wait until you have seen \(n/e\) candidates, and then take the first candidate that is the best of everyone you have seen so far.
Variance and Variance Properties

The variance of a random variable \(X\) is \(\mathbb{E}[(X-\mathbb{E}[X])^2]\). This is also equal to the often more convenient expression \(\mathbb{E}[X^2] - \mathbb{E}[X]^2\). 

We have \(\text{Var}(\sum_{i=1}^n X_i) = \sum_{i=1}^n \text{Var}(X_i)\) if \(X_1,\dots,X_n\) are pairwise independent. Also, \(\text{Var}(aX+b) = a^2\text{Var}(X)\) for \(a,b \in \mathbb{R}\).