AdamW Algorithm [Adam]

Description

AdamW is an optimization algorithm that builds upon the Adam technique by decoupling weight decay from the gradient update.

While Adam combines momentum and adaptive learning rates to efficiently move towards the minimum, AdamW improves generalization by applying weight decay directly to the parameters, rather than through the gradients. This adjustment helps prevent overfitting and leads to better performance in training deep neural networks.

Example

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

Info

The weight_decay parameter controls the strength of L2 regularization (Ridge).