Integrated Gradients [NLP] [Attribution]

Description

Integrated gradients is an attribution method used to explain the predictions of neural networks by quantifying the contribution of each input feature to the model's output. It computes feature attributions by integrating the gradients of the model's output with respect to the input, along a straight path from a baseline to the actual input.

Warning

Keep in mind that gradient-based methods in LLMs can be noisy due to sensitivity to input perturbations, mini-batch variance, and gradient saturation, affecting both training stability and interpretability. In optimization, noise can cause oscillations or suboptimal convergence, while in interpretability, methods such as integrated gradients can produce inconsistent attributions across runs. This instability reduces trust in model insights, especially for similar inputs. Techniques such as gradient smoothing, averaging, and second-order optimization help mitigate noise but add computational overhead, creating a trade-off between efficiency and precision in LLM development.

Example

import torch
from transformers import BertTokenizer, BertForSequenceClassification
import numpy as np
import matplotlib.pyplot as plt

def integrated_gradients(model, tokenizer, text, target_class, steps=50):
    input_ids = tokenizer.encode(text, return_tensors="pt")
    baseline_ids = torch.zeros_like(input_ids)
    alphas = torch.linspace(0, 1, steps)
    delta = input_ids - baseline_ids
    accumulated_grads = 0

    for alpha in alphas:
        interpolated_ids = baseline_ids + alpha * delta
        interpolated_ids.requires_grad_()

        outputs = model(interpolated_ids)
        pred = outputs.logits[:, target_class]

        model.zero_grad()
        pred.backward()

        accumulated_grads += interpolated_ids.grad

    attributions = (input_ids - baseline_ids) * accumulated_grads / steps
    return attributions.squeeze().detach().numpy()

model_name = "bert-base-uncased"
model = BertForSequenceClassification.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

text = "This movie was fantastic!"
target_class = 1  # Assuming 1 is the positive sentiment class

attributions = integrated_gradients(model, tokenizer, text, target_class)

# Visualize attributions
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(text))
plt.figure(figsize=(10, 5))
plt.bar(range(len(tokens)), attributions)
plt.xticks(range(len(tokens)), tokens, rotation=45)
plt.title("Integrated Gradients Attribution")
plt.show()

Info

After defining the model and tokenizer, the code runs the attribution method on an example text and displays the results as a bar plot, where each bar corresponds to a token and its importance for the target prediction.