Skip to content

Overview

Description

Interpretability in NNs refers to the model's ability to understand and explain how the model processes inputs and generates outputs.

Varieties:

Attention visualization techniques
Probing methods
Explaining LLM predictions with attribution methods
Interpretability in transformer-based LLMs
Mechanistic interpretability
Trade-offs between interpretability and performance