Learning Rate Decay * {1cycle}
Description
Learning Rate Decay helps gradually reduce the value of \(\alpha\) so the model can move more smoothly and accurately toward the global minimum.
In the image above:
- Each epoch means one full pass over all the mini-batches.
- The decay rate is a number that indirectly controls how much \(\alpha\) decreases in each epoch.
Varieties
With an Exponential Schedule, the learning rate is multiplied by a constant factor \(\gamma\) at fixed intervals (usually every epoch).
For instance, if \(\gamma = 0.9\):
- After 10 epochs the learning rate is about 35% of the initial value.
- After 20 epochs, it is about 12%.
With Cosine Annealing, the learning rate follows a smooth cosine curve that gradually decreases from a maximum value to a minimum value over a fixed number of epochs (or steps).
For instance:
- At the halfway point (\(t = T/2\)), the learning rate is roughly half of the initial range.
- Near the end (\(t \approx T\)), the learning rate becomes very small and approaches \(\eta_{\min}\) smoothly, without sudden drops.
Performance Scheduling, also called Adaptive Scheduling, keeps track of a given metric during training—typically the validation loss—and if this metric stops improving for some time, it multiplies the learning rate by some factor.
1cycle Scheduling increases the learning rate from a lower starting value to a high maximum value during the first part of training, then decreases it back to a very small value during the second part.
This schedule is designed to:
- Help the model escape sharp or poor minima early (using a high LR),
- Then converge smoothly at the end (using a very low LR).
It typically follows three phases:
- Warm-up: \(\eta\) increases linearly from \(\eta_{\min}\) to \(\eta_{\max}\) over the first half of the cycle.
- Cool-down: \(\eta\) decreases linearly from \(\eta_{\max}\) back to \(\eta_{\min}\) over the second half.
- Final Annealing: A very short final phase where \(\eta\) drops far below \(\eta_{\min}\) to help the model settle.
Formula
- \(\eta_{t}\): learning rate of the current epoch
- \(t\): the current epoch
- \(\eta_{\max}\): the maximum (initial) learning rate
Info
You usually choose \(\gamma < 1\) but close to 1 so the learning rate decreases slowly.
- \(\eta_{t}\): learning rate of the current epoch
- \(t\): the current epoch
- \(\eta_{\max}\): the maximum (initial) learning rate
- \(\eta_{\min}\): the minimum (final) learning rate
- \(T_{\max}\): the total number of epochs for one cosine cycle
Info
At the start (\(t = 0\)), the learning rate is at its maximum. As training progresses, the cosine function smoothly reduces the learning rate until it reaches \(\eta_{\min}\) at \(t = T\).
If metric does not improve for \(\text{patience}\) epochs:
- \(\eta_{\text{new}}\): the new learning rate
- \(\eta_{\text{old}}\): previous learning rate
- \(\text{factor}\): multiplier < 1 (e.g., 0.1)
- \(\text{patience}\): epochs without improvement
Example
import torch.optim as optim
model = ...
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.9)
for epoch in range(n_epochs):
for X_batch, y_batch in train_loader:
... # The rest of the training loop
scheduler.step()
import torch.optim as optim
model = ...
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=20, eta_min=1e-3)
for epoch in range(n_epochs):
for X_batch, y_batch in train_loader:
... # The rest of the training loop
scheduler.step()
Info
Cosine annealing needs \(T_{\max}\) and \(\eta_{\min}\), and it's difficult to predict the right training length. So many people prefer performance–driven LR schedulers instead.
import torch
import torch.optim as optim
import torchmetrics
from torch.utils.data import DataLoader
# =========================
# Init
# =====
torch.manual_seed(42)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# =========================
# Load Data
# =====
train_loader = DataLoader(...)
valid_loader = DataLoader(...)
# =========================
# Model
# =====
model = ...
# =========================
# Evaluation Function
# =====
def evaluate_func(model, data_loader, metric):
model.eval()
metric.reset() # Reset the metric at the beginning
with torch.no_grad():
for X_batch, y_batch in data_loader:
X_batch = X_batch.to(device)
y_batch = y_batch.to(device)
y_pred = model(X_batch)
metric.update(y_pred, y_batch) # Update at each iteration
return metric.compute() # Compute the final result at the end
# =========================
# Training Loop
# =====
model.train()
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="max", patience=2, factor=0.1)
accuracy_metric = torchmetrics.Accuracy(task="multiclass", num_classes=10).to(device)
n_epochs = 20
for epoch in range(n_epochs):
for X_batch, y_batch in train_loader:
... # The rest of the training loop
val_metric = evaluate_func(model, valid_loader, accuracy_metric).item()
scheduler.step(val_metric)
Info
Hyperparameters:
mode: Usemaxwhen the metric should increase (e.g., accuracy); useminwhen it should decrease (e.g., loss).patience: How many epochs to wait without improvement before lowering the learning rate.factor: The multiplier applied to reduce the learning rate when no improvement is seen.
import torch.optim as optim
model = ...
optimizer = optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)
n_epochs = 20
scheduler = optim.lr_scheduler.OneCycleLR(
optimizer,
max_lr=0.01,
epochs=n_epochs,
steps_per_epoch=len(train_loader),
pct_start=0.3, # 30% of steps for warm-up
)
for epoch in range(n_epochs):
for X_batch, y_batch in train_loader:
... # The rest of the training loop
scheduler.step() # Note: 1cycle scheduling is usually applied per batch/step, not per epoch!
Info
Hyperparameters:
max_lr: The highest learning rate in the cycle.total_steps: The total number of steps/iterations in the training run. This is crucial for defining the cycle length.pct_start: The percentage of the total steps used for the warm-up phase (increasing \(\eta\)).
