E1 260: Optimization for Machine Learning and Data Science¶

Homework 1: Theory of convex functions

This homework as two parts. In the first part, there are 10 questions related to the theory of convex functions. In the second part, we will use scikit-learn API, which an off-the-shelf tool for machine learning and data science to understand and solve three regression problems namely, logistic regression, softmax regression, and robust regression.

Submission instructions:

Make a copy of the Colab .ipynb file and share the link of your Colab file with answers as instructed.
Use text cells to answer Part A of the homework. Type equations in Latex in a text cell.
For the programming part of Part B of the homework, use code cell and text cell. The submitted codes should compile without any errors and do not erase the outputs.

# Name: 
# SR number:
# Program:
# Registration type (Credit/Audit):

Part A¶

This part has 10 questions.

[10 pts] Question 1. Show that the intersection of the solution set of a quadratic inequality

$$ \mathcal{C} = \{{\bf x} \in \mathbb{R}^n: {\bf x}^T{\bf A}{\bf x} + {\bf b}^T{\bf x} + c \leq 0\} $$

and a hyperplane defined by $${\bf g}^{T}{\bf x} + h = 0$$ is convex if

$${\bf A} + \lambda{\bf g}{\bf g}^T \succeq {\bf 0}.$$

Here, ${\bf A} \in \mathbb{S}^n$, ${\bf b} \in \mathbb{R}^n$, $c \in \mathbb{R}$, ${\bf g} \neq {\bf 0}$, and $\lambda \in \mathbb{R}$.

Solution 1.

Enter your answer here

[10 pts] Question 2. Suppose $h: \mathbb{R}^k \rightarrow \mathbb{R}$ and $g: \mathbb{R}^n \rightarrow \mathbb{R}^k$. Then show that their composition $f = h \circ g$ is convex when

$h$ is convex and monotonically nondecreasing, and $g$ is convex.
$h$ is convex and monotonically nonincreasing, and $g$ is concave.

Show that the function $f({\bf x}) = e^{\beta{\bf x}^T{\bf Q}{\bf x}}$ is convex for $\beta \in \mathbb{R}_+$ and ${\bf Q} \succeq {\bf 0}.$

Solution 2.

Enter your answer here

[10 pts] Question 3. Show that the sum of $k$ largest eigenvalues of ${\bf X} \in \mathbb{S}^n$ is convex on $\mathbb{S}^n$.

Solution 3.

Enter your answer here

[10 pts] Question 4. Show that the solution set of a convex optimization problem

$$ \begin{aligned} \underset{\bf x}{\text{minimize}} \quad \quad &f_0({\bf x})\nonumber\\ \text{subject to} \quad &f_i({\bf x}) \leq 0, \, i=1,2,\cdots,m\nonumber\\ &{\bf a}_i^T{\bf x} = {b}_i, \, i=1,2,\cdots,p \end{aligned} $$

is convex.

Solution 4.

Enter your answer here

[10 pts] Question 5. Let $\mathcal{X} \subset \mathbb{R}^n$ be an open convex set. A convex differentiable function $f:\mathcal{X} \rightarrow \mathbb{R}$ is $L$-smooth if and only if $$ g({\bf x}) = f({\bf x}) - \frac{L}{2} \|{\bf x}\|^2 $$ is concave.

Prove the above proposition.

Solution 5.

Enter your answer here

[10 pts] Question 6. Show that there exists at most one global minimum of a strictly convex function $f$.

Solution 6.

Enter your answer here

[10 pts] Question 7. Prove the following proposition.

Strong convexity $\Rightarrow$ Strictly Convexity $\Rightarrow$ Convexity.

$\Rightarrow$ is read as "implies".

Solution 7.

Enter your answer here

[10 pts] Question 8.

A. Give one example of a function that is

convex, but not strictly convex
stricly convex, but not strongly convex

B. Is $x^4$ $L$-Lipschitz smooth?

Solution 8.

Enter your answer here

[5 pts] Question 9.: Suppose $f(\cdot)$ is $\alpha$-strongly convex and $g(\cdot)$ is convex. Show that $h = f+g$ is strongly convex.

Solution 9.

Enter your answer here

[15 pts] Question 10. If $f:\mathbb{R}^n \rightarrow \mathbb{R}$ is a closed and $\alpha$ strongly function, then its conjugate function $f^*$ is $\frac{1}{\alpha}$ smooth.

Hint: Assume $f^*$ is differentiable everywhere, with gradient $$ \nabla f^*({\bf y}) = \underset{{\bf x}}{\arg\max}\, ({\bf y}^{T}{\bf x} - f({\bf x})). $$

Find the conjugate function of

$$ f({\bf x})=\begin{cases} - \sqrt{1 - \|{\bf x}\|_2^2}, & \|{\bf x}\|_2 \leq 1\\ \infty & \text{else.} \end{cases} $$

Also, compute the strong convexity parameter of $f$ and smoothness parameter of $f^*$ to verify the above claim.

Solution 10.

Enter your answer here

Part B¶

This part has two programming questions on Huber regression and logistic regression

Question 11. Outlier-robust Huber regression

In a linear regression problem, we are given data $({\bf x}_i,y_i)\in \mathbb{R}^n \times \mathbb{R}$, $i=1,\ldots, m$. We fit an affine model

$$\hat{y}_i = {\bf x}_i^T{\boldsymbol \beta}$$

for $i=1,\ldots, m$ with ${\boldsymbol \beta} \in \mathbb{R}^n$.

In standard linear regression, we choose ${\boldsymbol \beta}$ by solving $$ \underset{{\boldsymbol \beta}}{\text{minimize}} \quad \sum\limits_{i=1}^m r_i^2 = \sum\limits_{i=1}^m (y_i - {\bf x}_i^T{\boldsymbol \beta})^2. $$

Huber regression is a technique that is robust to outliers. In Huber regression, we replace the square function with the Huber function

$$ \phi(u) = \left\{ \begin{array}{ll} u^2 & |u| \leq M\\ 2M|u| - M^2 & |u| > M \end{array}\right. $$

where $M>0$ is the Huber threshold. That is, we solve

$$ \underset{{\boldsymbol \beta}}{\text{minimize}} \sum\limits_{i=1}^m \phi(y_i - {\bf x}_i^T{\boldsymbol \beta}) $$

for variable ${\boldsymbol \beta}.$

[10 pts] Question 11a. Is $\phi(\cdot)$ $L$-smooth? If yes, what is the smoothness parameter? Show that $\phi(\cdot)$ is a convex function.

Solution 11a. Enter you answer here

[5 pts] Question 11b. Plot the Huber function $\phi(u)$ for different values of $M \in \{1,5, 1000\}$ with $u\in [-10,10]$. Compare with the plot of the quadratic function $u^2$.

# Write your code here

import numpy as np
from matplotlib import pyplot as plt

# Define the Huber loss
#def Phi(u, M):

[25 pts] Question 11c.

We generate data with $m=500$ measurements and $n=300$ regressors. We randomly choose ${\boldsymbol \beta}$ and ${\bf x}_i \sim \mathcal N(0,I)$. We set $y_i = {\bf x}_i^T{\boldsymbol \beta} + \epsilon_i$, where $\epsilon_i \sim \mathcal N(0,1)$. To introduce outliers, with probability $p$ we replace $y_i$ with $-y_i$. We generate $50$ problem instances, with $p$ varying from $0$ to $0.15$,

Recover ${\boldsymbol \beta} \in {\bf R}^n$ from the measurements $y\in {\bf R}^m$ and compare the three approaches: linear regression, Huber regression and clairvoyant regression. In clairvoyant regression we know which measurements had their sign flipped.
Plot the relative error $$ \frac{\|\hat{\boldsymbol \beta} - {\boldsymbol \beta}\|_2}{\|{\boldsymbol \beta}\|2}, $$ where $\hat{\boldsymbol \beta}$ is the recovered ${\boldsymbol \beta}$.
Comment on your observations

Use sklearn.linear_model for Huber regression and linear regression

## Solution 11c.
import numpy as np
from sklearn.linear_model import HuberRegressor, LinearRegression

## Generate data 
np.random.seed(1)
n = 300
m = 450
TESTS = 50

beta_true = 5*np.random.normal(size=(n,1))
X = np.random.randn(n, m)
Y = np.zeros((m,1))

p_vals = np.linspace(0,0.15, num=TESTS)

# MODIFY and ENTER YOUR CODE HERE

for tst, p in enumerate(p_vals):
    # Generate the sign changes.
    factor = 2*np.random.binomial(1, 1-p, size=(m,1)) - 1
    Y = factor*X.T.dot(beta_true)

    # Solve standard regression 


    # Solve clairvoyant regression 



     # Solve Huber regression  with M=1

# Plot the relative error for the three methods
# for p_vals

Question 12. Logistic regression

We work with a famous dataset that contains the sepal and petal length and width of 150 iris flowers of three different species: Iris setosa, Iris versicolor, and Iris virginica.

import numpy as np
from sklearn import datasets
iris = datasets.load_iris()

print(iris.DESCR)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Our aim is to build a classifier to detect the Iris virginica type based on the petal width and length features.

Given data $({\bf x}_i,y_i)$, $i=1,\ldots, m$. The ${\bf x}_i \in {\bf R}^n$ are feature vectors and $y_i \in \{0, 1\}$ indicates the associated boolean classes.

Our goal is to construct a linear classifier determined by the halfspace ${\bf x}^T{\boldsymbol \beta}>0$. Specifically, we predict that the $i$th instance belongs to class 1 if ${\bf x}^T{\boldsymbol \beta} >0$ and class $0$ otherwise. We find ${\boldsymbol \beta}$ by solving the maximizing likelihood problem

$$ \underset{\boldsymbol \beta}{\text{maximize}} \quad f({\boldsymbol \beta}) = \sum_{i=1}^{m} y_i {\bf x}_i^T{\boldsymbol \beta} - \log(1 + \exp ({\bf x}_i^T{\boldsymbol \beta})). $$

[10 pts] Question 12a. Show that $-f$ is a convex function of ${\boldsymbol \beta}$.

Suppose we would like to regularize the logistic regression model with an $\ell_2$ penalty to maximize the objective $g({\boldsymbol \beta})=f({\boldsymbol \beta}) - \lambda \|{\boldsymbol \beta}\|_2^2$. Is the regularized problem a convex optimization problem?

Solution 12a.

Enter your solution here

[10 pts] Question 12b. Use sklearn.linear_model to train a logistic regression model.

Show the linear boundary displaying two features petal width and length along with the instances (i.e., Iris virginica in one halfspace and not Iris virginica in the other).

from sklearn.linear_model import LogisticRegression
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = (iris["target"] == 2).astype(np.int) # 1 if Iris virginica, else 0


# Enter your code here

[20 pts] Question 12c. Extension of logistic regression to handle multiple classes is referred to as multinomial logistic regression or softmax regression.

Given data $({\bf x}_i, {\bf y}_i)$, $i=1,2,\ldots,m$, where the one-hot vector ${\bf y}_i \in \{0,1\}^K$ indicates the class of the $i$th instance. Formulate the softmax regression problem to support $K$ classes with the parameter matrix ${\boldsymbol \Theta} = [{\boldsymbol \beta}_1,{\boldsymbol \beta}_2,\cdots,{\boldsymbol \beta}_K] \in \mathbb{R}^{n \times K}$.

Is this a convex optimization?

[20 pts] Question 12c. Use sklearn.linear_model to train a softmax regression model.

Shows the resulting decision boundaries with background colors. The decision boundaries between any two classes are linear.

from sklearn.linear_model import LogisticRegression
X = iris["data"][:, (2, 3)]  # petal length, petal width
y = iris["target"] # Unlike the previous question, there are three class


# Enter your code here