Skip to content

Overview

Description

Interpretability in NNs refers to the model's ability to understand and explain how the model processes inputs and generates outputs.

Varieties:

  • Attention visualization techniques
  • Probing methods
  • Explaining LLM predictions with attribution methods
  • Interpretability in transformer-based LLMs
  • Mechanistic interpretability
  • Trade-offs between interpretability and performance