Overview
Description
Interpretability in NNs refers to the model's ability to understand and explain how the model processes inputs and generates outputs.
Varieties:
- Attention visualization techniques
- Probing methods
- Explaining LLM predictions with attribution methods
- Interpretability in transformer-based LLMs
- Mechanistic interpretability
- Trade-offs between interpretability and performance