This is just a small selection of optimizers (i.e. the algorithm that learns the weights based on a loss). Nevertheless, typically [Adam]( or [SGD]( will be the first algorithm you will try.
"torch.optim.lr_scheduler provides several methods to adjust the learning rate based on the number of epochs."
Why do you want to reduce the learning rate: Well, typically you want to start with a large learning rate for jumping over local minima but later you want to anneal the learning rate because otherwise the optimizer will jump over / oscillate around the minima.
A non-representative selection is
|[lr_scheduler.StepLR](| Decays the learning rate of each parameter group by gamma every step_size epochs.|
|[lr_scheduler.MultiStepLR](| Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones.|
|[lr_scheduler.ConstantLR](| Decays the learning rate of each parameter group by a small constant factor until the number of epoch reaches a pre-defined milestone: total_iters.|
|[lr_scheduler.LinearLR](| Decays the learning rate of each parameter group by linearly changing small multiplicative factor until the number of epoch reaches a pre-defined milestone: total_iters.|
|[lr_scheduler.ExponentialLR]( |Decays the learning rate of each parameter group by gamma every epoch.|
|[lr_scheduler.ReduceLROnPlateau](| Reduce learning rate when a metric has stopped improving.|
However, typically I only use [lr_scheduler.ReduceLROnPlateau](
We want to monitor our progress and will use Tensorboard for this.
In the beginning we need to open a Tensorboard session
import os
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
from torch.utils.tensorboard import SummaryWriter
tb = SummaryWriter()
Afterwards we need to close the Tensorboard session again
During learning we can flush the information. This allows us to observer the development in parallel in the viewer (a viewer that is build into **VS code** I might add...).
or [add scalars]( (e.g. performances or loss values)
We can also add [images](, [matplotlib figures](, [videos](, [audio](, [text](, [graph data](, and other stuff. Just because we can doesn't mean that we want to...
We can use the [event_accumulator]( to retrieve the stored information.
## MNIST with Adam, ReduceLROnPlateau, cross-entropy on CPU
Operations you will see that are not explained yet:
|[network.train()](| : "Sets the module in training mode."|
|[optimizer.zero_grad()](| : "Sets the gradients of all optimized [torch.Tensor]( to zero." For every mini batch we (need to) clean the gradient which is used for training the parameters. |
|[optimizer.step()](| : "Performs a single optimization step (parameter update)."|
|[loss.backward()](| : "Computes the gradient of current tensor w.r.t. graph leaves."|
|[lr_scheduler.step(train_loss)](| : After an epoch the learning rate (might be) changed. For other [Learning rate scheduler]( .step() might have no parameter.|
|[network.eval()](| : "Sets the module in evaluation mode."|
|[with torch.no_grad():](| : "Context-manager that disabled gradient calculation."|