E1 260: Optimization for Machine Learning and Data Science¶

Homework 4: Stochastic gradient methods

This homework as two parts. In the first part, there are a question related to theoretical aspects of stochastic gradient descent and averaging iterations. The second part has two questions related to the understanding of gradient descent, stochastic gradient descent and its mini-batch version, and implementing SGD and SVRG for softmax regression for classification of fashion MNIST dataset.

Submission instructions:

Make a copy of the Colab .ipynb file and share the link of your Colab file with answers as instructed.
Use text cells to answer Part A of the homework. Type equations in Latex in a text cell.
For the programming part of Part B of the homework, use code cell and text cell. The submitted codes should compile without any errors and do not erase the outputs.

# Name: 
# SR number:
# Program:
# Registration type (Credit/Audit):

Part A¶

[50 pts] Question 1. We wish to minimize the quadratic function

$$ f({\bf x}) = \frac{1}{2}{\bf x}^T{\bf P}{\bf x} - {\bf q}^T{\bf x} $$

where ${\bf P} \in \mathbb{R}^{d \times d}$ and ${\bf q} \in \mathbb{R}^d$ with ${\bf x}^\star$ being the global minimum.

Suppose we have access to a noisy oracle for the gradient $$ \tilde{\bf g}_t = \nabla f({\bf x}_t) + {\bf e}_t $$ where we assume the noise ${\bf e}_t$ is is zero mean and uncorrelated with bounded variance $\mathbb{E}[{\bf e}_t{\bf e}_t^T] \leq \sigma^2{\bf I}$. Here $\sigma^2 \geq 0$.

Give the stochastic gradient iteration with a fixed step size $\eta$.
Define ${\bf a}_t = {\bf x}_t - {\bf x}^\star$. Express ${\bf a}_t$ iterms of the initial iterate ${\bf a}_0$, ${\bf P}$, $\eta$, and ${\bf e}_t$.
Suppose $\bar{\bf a}_t = \frac{1}{t}\sum\limits_{k=0}^t{\bf a}_k$. Show that

$$ \mathbb{E}\{\bar{\bf a}_t^T{\bf P}\bar{\bf a}_t\} \leq \frac{1}{\eta t}\|{\bf a}_0\|^2 + \frac{\sigma^2 d}{t}. $$

Hint: You might need the inequality $\frac{(1 - (1-b)^t)^2}{bt} \leq 1$ for all $b \in [0,1]$.

Solution 1.

Enter your answer here

Part B¶

This part has two programming questions on understanding of stochastic gradient descent and its application to softmax regression

[20 pts] Question 2.

The aim of this question is to understand the gradient descent vs. stochastic gradient descent vs. mini-batch for a linear regression problem

$$ \underset{{\bf x}}{\text{minimize}} \quad f({\bf x}) = \frac{1}{m}\sum\limits_{i=1}^m ({\bf a}_i^T{\bf x} - {b}_i)^2. $$

Consider a simple setting with ${\bf x} \in \mathbb{R}^2$ with $m=100$.

## Solution 2.
import numpy as np
import matplotlib.pyplot as plt

import numpy as np
np.random.seed(42)
# data generation
A_ = 2 * np.random.rand(100, 1)
y = 4 + 3 * A_ + np.random.randn(100, 1)
A = np.c_[np.ones((100, 1)), A_]  # add x0 = 1 to each instance


# plotting data

plt.plot(A_, y, "b.")
plt.xlabel("$x_1$", fontsize=18)
plt.ylabel("$x_2$", rotation=0, fontsize=18)
plt.axis([0, 2, 0, 15])
plt.show()


np.linalg.pinv(A).dot(y)

array([[4.21509616],
       [2.77011339]])

Display the iterates (e.g., see the picture below) of

Gradient descent with a fixed stepsize 0.1, for 100 iteration.
Stochastic gradient descent with a varying stepsize. Use the stepsize_schedule function with eta0 = 5, etat1 = 50, and the number of iterations as 50.
Mini-batch gradient descent with a varying stepsize. Use the stepsize_schedule function with eta0=200, eta1=1000, batch size of 20 and the number of iterations as 50.

# Enter your code here

def stepsize_schedule(eta):
    return eta0 / (eta + eta1)

[30 pts] Question 3. Softmax regression.

Implement stochastic gradient descent, stochastic gradient descent with momentum, and stochastic variance reduced gradient (SVRG) for softmax regression (refer softmax regression problem from Homework 1).

We will work with the fashion MNIST dataset.

import tensorflow as tf
from tensorflow import keras
import matplotlib.pyplot as plt

fashion_mnist = keras.datasets.fashion_mnist
(X_train_full, y_train_full), (X_test, y_test) = fashion_mnist.load_data()

#Split the full training set into a validation set and a (smaller) training set. 
#We also scale the pixel intensities down to the 0-1 range and convert them to floats, by dividing by 255.

X_valid, X_train = X_train_full[:5000] / 255., X_train_full[5000:] / 255.
y_valid, y_train = y_train_full[:5000], y_train_full[5000:]
X_test = X_test / 255.

#convert the vector of class indices into a matrix containing a one-hot vector for each instance
num_classes = 10
class_names = ["T-shirt/top", "Trouser", "Pullover", "Dress", "Coat",
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]
y_train_onehot = tf.keras.utils.to_categorical(y_train, num_classes)
y_test__onehot = tf.keras.utils.to_categorical(y_test, num_classes)

# Sample dataset
n_rows = 4
n_cols = 10
plt.figure(figsize=(n_cols * 1.2, n_rows * 1.2))
for row in range(n_rows):
    for col in range(n_cols):
        index = n_cols * row + col
        plt.subplot(n_rows, n_cols, index + 1)
        plt.imshow(X_train[index], cmap="binary", interpolation="nearest")
        plt.axis('off')
        plt.title(class_names[y_train[index]], fontsize=12)
plt.subplots_adjust(wspace=0.2, hspace=0.5)
plt.show()

Implementation

Show the loss and accuracy for different iterations for

Stochastic gradient descent
Stochastic gradient descent with momentum
Stochastic variance reduced gradient (SVRG)

Choose the stepsize and momentum coefficients to obtain meaningful learning curves.

Also, compare the test accuracy of the above three optimizers with the sklearn.linear_model solver that you have used in Homework 1.

Comment on your observations.

## Solution 3.
import numpy as np


## Generate data 
np.random.seed(1)
n = 3000
m = 20


# Enter your code here