This repository collects all relevant resources about interpretability in LLMs
-
Updated
Nov 1, 2024
This repository collects all relevant resources about interpretability in LLMs
MICCAI 2022 (Oral): Interpretable Graph Neural Networks for Connectome-Based Brain Disorder Analysis
[KDD'22] Source codes of "Graph Rationalization with Environment-based Augmentations"
(ICML 2023) Discover and Cure: Concept-aware Mitigation of Spurious Correlation
Official code for the CVPR 2022 (oral) paper "OrphicX: A Causality-Inspired Latent Variable Model for Interpreting Graph Neural Networks."
[ICCV 2023] Learning Support and Trivial Prototypes for Interpretable Image Classification
[TPAMI 2025] Mixture of Gaussian-distributed Prototypes with Generative Modelling for Interpretable and Trustworthy Image Recognition
[CVPR 2025] Concept Bottleneck Autoencoder (CB-AE) -- efficiently transform any pretrained (black-box) image generative model into an interpretable generative concept bottleneck model (CBM) with minimal concept supervision, while preserving image quality
Codebase the paper "The Remarkable Robustness of LLMs: Stages of Inference?"
hopwise: A Python Library for Explainable Recommendation based on Path Reasoning over Knowledge Graphs, ACM CIKM '25
TraceFL is a novel mechanism for Federated Learning that achieves interpretability by tracking neuron provenance. It identifies clients responsible for global model predictions, achieving 99% accuracy across diverse datasets (e.g., medical imaging) and neural networks (e.g., GPT).
Explainable AI: From Simple Rules to Complex Generative Models
Layer-wise Semantic Dynamics (LSD) is a model-agnostic framework for hallucination detection in Large Language Models (LLMs). It analyzes the geometric evolution of hidden-state semantics across transformer layers, using contrastive alignment between model activations and ground-truth embeddings to detect factual drift and semantic inconsistency.
This repository contains the official code of the paper: "Comprehensive Information Bottleneck for Unveiling Universal Attribution to Interpret Vision Transformers", which is published in CVPR 2025.
Explainable Boosting Machines
Semi-supervised Concept Bottleneck Models (SSCBM)
Code for locating "critical neurons" in LLMs. We show that masking as few as 3 neurons can cripple a model's capabilities (ICLR 2026).
Explainable Speaker Recognition
An intepretable model for survival prediction in competing risks settings. Checkout our blog!! https://vectorinstitute.github.io/crisp-nam/blog/
Build a Neural net from scratch without keras or pytorch just by using numpy for calculus, pandas for data loading.
Add a description, image, and links to the interpretability-and-explainability topic page so that developers can more easily learn about it.
To associate your repository with the interpretability-and-explainability topic, visit your repo's landing page and select "manage topics."