ML Crash Course Notes

March 1, 2019 • Read: 1346 • ML

Training and Loss

Training a model is called empirical risk minimization.

Mean square error (MSE) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

$ MSE = \frac{1}{N} \sum_{(x,y)\in D} (y - prediction(x))^2$


  • $(x, y)$ is an example in which

    • $x$ is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
    • $y$ is the example's label (for example, temperature).
  • $prediction(x)$ is a function of the weights and bias in combination with the set of features.
  • $D$ is a data set containing many labeled examples, which are pairs.
  • $N$ is the number of examples in.

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

Reducing Loss

An Iterative Approach

The cycle of moving from features and labels to models and predictions.

Figure 1. An iterative approach to training a model.

Gradient Descent

A second point on the U-shaped curve, this one a little closer to the minimum point.


The gradient of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

$\nabla f$

For instance, if:

$f(x,y) = e^{2y}\sin(x)$


$\nabla f(x,y) = \left(\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y)\right) = (e^{2y}\cos(x), 2e^{2y}\sin(x))$

Note the following:

$\nabla f$Points in the direction of greatest increase of the function.
$- \nabla f$Points in the direction of greatest decrease of the function.

The number of dimensions in the vector is equal to the number of variables in the formula for $f$; in other words, the vector falls within the domain space of the function. For instance, the graph of the following function $f(x, y)$:

$f(x,y) = 4 + (x - 2)^2 + 2y^2$

when viewed in three dimensions with $z = f(x,y)$ looks like a valley with a minimum at $(2, 0, 4)$:

A three-dimensional plot of z = 4 + (x - 2)^2 + y^2, which produces a paraboloid with a minimum at (2,0,4)

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

A second point on the U-shaped curve, this one a little closer to the minimum point.

Figure 5. A gradient step moves us to the next point on the loss curve.

Same U-shaped curve. Lots of points are very close to each other and their trail is making extremely slow progress towards the bottom of the U.

Figure 6. Learning rate is too small.

Same U-shaped curve. This one contains very few points. The trail of points jumps clean across the bottom of the U and then jumps back over again.

Figure 7. Learning rate is too large.

Same U-shaped curve. The trail of points gets to the minimum point in about eight steps.

Figure 8. Learning rate is just right.

The ideal learning rate in one-dimension is $\frac{ 1 }{ f(x)'' }$ (the inverse of the second derivative of f(x) at x).

The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).

Mini-batch stochastic gradient descent (mini-batch SGD) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

Introducing TensorFlow

The following figure shows the current hierarchy of TensorFlow toolkits:

Hierarchy of TensorFlow toolkits. Estimator API is at the top.

Estimator (tf.estimator)High-level, OOP API.
tf.layers/tf.losses/tf.metricsLibraries for common model components.
TensorFlowLower-level APIs
Last Modified: September 20, 2019
Archives Tip
QR Code for this page
Tipping QR Code