# Training and Loss

**Training** a model is called **empirical risk minimization**.

**Mean square error** (**MSE**) is the average squared loss per example over the whole dataset. To calculate MSE, sum up all the squared losses for individual examples and then divide by the number of examples:

$ MSE = \frac{1}{N} \sum_{(x,y)\in D} (y - prediction(x))^2$

where:

$(x, y)$ is an example in which

- $x$ is the set of features (for example, chirps/minute, age, gender) that the model uses to make predictions.
- $y$ is the example's label (for example, temperature).

- $prediction(x)$ is a function of the weights and bias in combination with the set of features.
- $D$ is a data set containing many labeled examples, which are pairs.
- $N$ is the number of examples in.

Although MSE is commonly-used in machine learning, it is neither the only practical loss function nor the best loss function for all circumstances.

# Reducing Loss

#### An Iterative Approach

**Figure 1. An iterative approach to training a model.**

#### Gradient Descent

**Gradients**

The **gradient** of a function, denoted as follows, is the vector of partial derivatives with respect to all of the independent variables:

$\nabla f$

For instance, if:

$f(x,y) = e^{2y}\sin(x)$

then:

$\nabla f(x,y) = \left(\frac{\partial f}{\partial x}(x,y), \frac{\partial f}{\partial y}(x,y)\right) = (e^{2y}\cos(x), 2e^{2y}\sin(x))$

Note the following:

$\nabla f$ | Points in the direction of greatest increase of the function. |
---|---|

$- \nabla f$ | Points in the direction of greatest decrease of the function. |

The number of dimensions in the vector is equal to the number of variables in the formula for $f$; in other words, the vector falls within the domain space of the function. For instance, the graph of the following function $f(x, y)$:

$f(x,y) = 4 + (x - 2)^2 + 2y^2$

when viewed in three dimensions with $z = f(x,y)$ looks like a valley with a minimum at $(2, 0, 4)$:

To determine the next point along the loss function curve, the gradient descent algorithm adds some fraction of the gradient's magnitude to the starting point as shown in the following figure:

**Figure 5. A gradient step moves us to the next point on the loss curve.**

**Figure 6. Learning rate is too small.**

**Figure 7. Learning rate is too large.**

**Figure 8. Learning rate is just right.**

The ideal learning rate in one-dimension is $\frac{ 1 }{ f(x)'' }$ (the inverse of the second derivative of f(x) at x).

The ideal learning rate for 2 or more dimensions is the inverse of the Hessian (matrix of second partial derivatives).

**Mini-batch stochastic gradient descent** (**mini-batch SGD**) is a compromise between full-batch iteration and SGD. A mini-batch is typically between 10 and 1,000 examples, chosen at random. Mini-batch SGD reduces the amount of noise in SGD but is still more efficient than full-batch.

# Introducing TensorFlow

The following figure shows the current hierarchy of TensorFlow toolkits:

Toolkit(s) | Description |
---|---|

Estimator (tf.estimator) | High-level, OOP API. |

tf.layers/tf.losses/tf.metrics | Libraries for common model components. |

TensorFlow | Lower-level APIs |

Copyright © 2019 Weslie. All rights reserved.