E1 260: Optimization for Machine Learning and Data Science¶

Homework 2: Unconstrained optimization (gradient, subgradient, and acceleration methods)

This homework as two parts. In the first part, there are 5 questions related to theoretical aspects of unconstrained optimization. In the second part, we apply the theory developed in the course in this module to two popular problems, namely, robust regression and piecewise-linear optimization.

Submission instructions:

Make a copy of the Colab .ipynb file and share the link of your Colab file with answers as instructed.
Use text cells to answer Part A of the homework. Type equations in Latex in a text cell.
For the programming part of Part B of the homework, use code cell and text cell. The submitted codes should compile without any errors and do not erase the outputs.

# Name: 
# SR number:
# Program:
# Registration type (Credit/Audit):

Part A¶

This part has 10 questions.

[10 pts] Question 1. Consider a simple example of minimizing

$$ f(x) = x^2 $$

with gradient descent. This is L-smooth convex function.

Suppose we choose a fixed step size $\eta = 1/L$, what is the upper bound on the suboptimality of the iterate at step $T$? In practice, how many iterations are required for the algorithm to converge?
Now suppose, we incorrectly choose $L = 4$. What happens to the above bound? How many iterations are required to achieve $f(x_T) \leq \epsilon.$

Solution 1.

Enter your answer here

[10 pts] Question 2. Proof the following.

Suppose $f$ is $\mu$-strongly convex and $L$-smooth. Then gradient descent with $\eta_t = \eta = \frac{2}{\mu+L}$ satisfies

$\|{\bf x}_T - {\bf x}^\star\|_2 \leq \left(\frac{k-1}{k+1}\right)^T \|{\bf x}_0 - {\bf x}^\star\|_2$
$f({\bf x}_T) - f({\bf x}^\star) \leq \frac{L}{2}\left(\frac{k-1}{k+1}\right)^{2T} \|{\bf x}_0 - {\bf x}^\star\|^2_2$

where $k = L/\mu$ is the condition number.

Solution 2.

Enter your answer here

[10 pts] Question 3.

Explain how to find the subgradients of the following.

$f({\bf x}) = \lambda_{\rm max}({\bf A}({\bf x}))$, where ${\bf A}(x) = {\bf A}_0 + x_1{\bf A}_1+ \cdots+ x_n{\bf A}_n$ with symmetric coefficients A_i.
$f({\bf x}) = {\rm inf}_{{\bf y} \in \mathcal{C}} \|{\bf x} - {\bf y}\|_2$, where $\mathcal{C}$ is a closed convex set.

Find the subdifferential of the following

$f({\bf x}) = \underset{i=1,\ldots,m}{\max} ({\bf a}_i^T{\bf x} + b_i)$
$f({\bf x}) = \|{\bf A}{\bf x} + {\bf b}\|_1$

Solution 3.

Enter your answer here

[10 pts] Question 4. Use subgradient method to find a point in the intersection of two closed convex sets $\mathcal{C}_1$ and $\mathcal{C}_2$ by solving

$$ \text{find} \quad {\bf x} \in \mathcal{C}_1 \cup \mathcal{C}_2 \quad \Leftrightarrow \quad {\rm minimize}_{\bf x} \,\, f({\bf x}) = {\max} (f_1({\bf x}),f_2({\bf x})) $$

where $f_i({\bf x}) = {\rm inf}_{{\bf y} \in \mathcal{C}_i} \|{\bf x} - {\bf y}\|_2$ is the Euclidean distance of ${\bf x}$ to $\mathcal{C}_i.$

Show that the subgradient method with Polyak's step size rule boils down to the alternating projection method.

Solution 4.

Enter your answer here

[10 pts] Question 5. Let $f: \mathbb{R}^n \rightarrow \mathbb{R}$ be convex and recall that its conjugate is $f^*({\bf y}) = {\rm sup}_{\bf x} \{{\bf y}^T{\bf x} - f({\bf x})\}$. Prove the following:

For any ${\bf y}$, we have ${\bf y}^T{\bf x} \leq f({\bf x}) +f^*({\bf y})$.
We have ${\bf g}^T{\bf x} = f({\bf x}) + f^*({\bf g})$ if and only if ${\bf g} \in \partial f({\bf x})$

Also, show that for an arbitrary norm $f({\bf x}) = \|{\bf x}\|$, we have ${\bf g} \in \partial f({\bf 0})$ for any ${\bf g}$ that satisfies $\|{\bf g}\|_* \leq 1.$ Here $\|\cdot\|_*$ is the dual norm. Determine $\partial f({\bf 0})$ for $f({\bf x}) = \|{\bf x}\|_\infty.$

Solution 5.

Enter your answer here.

Part B¶

This part has three programming questions on understanding of gradient descent, Huber regression and piecewise-linear optimization.

[10 pts] Question 6. Gradient descent.

The aim of this question is to understand the geometric properties of gradient descent and the heavy-ball method for the simple two-dimensional quadratic form

$$f(\mathbf{x}) = 0.1 x_1^2 + 2 x_2^2.$$

Use ${\bf x}_0 = [4,6]^T$ as an initial point.

Perform gradient descent with a fixed step size of $\eta = 0.4$ and display the first 20 iterations on a contour plot, e.g., as

Now, perform gradient descent with a step size of $\eta = 0.5$, and to compare, also implement heavy-ball method: ${\bf x}_{k+1} = {\bf x}_{k} - \eta {\bf g}_{k} + \beta({\bf x}_{k} - {\bf x}_{k-1})$ with $\eta = 0.5$ and $\beta = 0.3$. As before display the iterations on two contour plots for gradient descent and the heavy-ball method.

Comment on your observations.

Optional: Check for different values of $\eta$ (including, exact line search) and $\beta$

import numpy as np
import matplotlib.pyplot as plt

# Enter your code here

[20 pts] Question 7. Outlier-robust Huber regression.

In a Huber regression problem, we are given data $({\bf x}_i,y_i)\in \mathbb{R}^n \times \mathbb{R}$, $i=1,\ldots, m$. we solve

$$ \underset{{\boldsymbol \beta}}{\text{minimize}} \sum\limits_{i=1}^m \phi(y_i - {\bf x}_i^T{\boldsymbol \beta}) $$

for variable ${\boldsymbol \beta},$ where the Huber function

$$ \phi(u) = \left\{ \begin{array}{ll} u^2 & |u| \leq M\\ 2M|u| - M^2 & |u| > M \end{array}\right. $$

where $M>0$ is the Huber threshold.

Implementation

We generate data with $m=500$ measurements and $n=300$ regressors. We randomly choose ${\boldsymbol \beta}$ and ${\bf x}_i \sim \mathcal N(0,I)$. We set $y_i = {\bf x}_i^T{\boldsymbol \beta} + \epsilon_i$, where $\epsilon_i \sim \mathcal N(0,1)$. To introduce outliers, with probability $p$ we replace $y_i$ with $-y_i$. We generate a problem instance with $p= 0.01$.

Implement gradient descent, to recover ${\boldsymbol \beta} \in {\bf R}^n$ from the measurements $y\in {\bf R}^m$ by solving Huber regression.
Plot the suboptimality gap $$ f({\boldsymbol \beta}_k) - f^\star, $$ where you can use the solution from homework 1, i.e., sklearn.linear_model to compute $f^\star$.
Use a fixed step size of $1/L$.
Suppose now you have an incorrect estimate $L$ and use that to compute the step size. Based on the numerical results, comment on your observations how critical is to use the correct value for $L$.

## Solution 6.
import numpy as np
from sklearn.linear_model import HuberRegressor

## Generate data 
np.random.seed(1)
n = 300
m = 450

beta_true = 5*np.random.normal(size=(n,1))
X = np.random.randn(n, m)
Y = np.zeros((m,1))

# Enter your code here

[20 pts] Question 8.

Consider the problem of minimizing a piecewise linear function

$$ {\rm minimize}_{\bf x} \,\, f({\bf x}) = \underset{i=1,\ldots,m}{\max} ({\bf a}_i^T{\bf x} + b_i) $$

with variable ${\bf x} \in \mathbb{R}^d$. We implement the subgradient method with $d=20$ variables and $m=100$ terms. We generate data ${\bf a}_i$ and $b_i$ from a unit normal distribution and initialize with ${\bf x}_0 = {\bf 0}$.

Implement a subgradient method to solve the above problem. Express the above problem as a linear problem and use CVX solver to find $f^{\rm opt}$.

Through simulations, plot the convergence of $f^{\rm best}_k - f^{\rm opt}$ for

fixed step size $\eta = 0. 01$
Polyak's step size rule
Heavy-ball method: ${\bf x}_{k+1} = {\bf x}_{k} - \eta_k {\bf g}_{k} + \beta_k({\bf x}_{k} - {\bf x}_{k-1})$, where $\eta_k = 1/k$ (also try Polyak's step size) and $\beta \in [0,1]$. Try a few values for $\beta$ and report your observations.

## Generate data 
np.random.seed(1)

A = np.random.randn(m, d)
b = np.random.randn(m, 1)


# Enter your code here