Explaining Deep Neural Networks Through Inverse Classification

Abstract

Deep neural networks (DNNs) have achieved impressive performance across a wide range of tasks, yet their lack of interpretability remains a fundamental barrier to trust, safety, and deployment in critical applications. This talk addresses this challenge through the lens of inverse classification—the task of determining minimal, semantically meaningful changes to an input that would alter a model’s prediction—thereby uncovering the inner mechanisms and vulnerabilities of DNNs. In particular, the work contributes novel mathematical frameworks and algorithms for generating and learning from structured, sparse, and plausible counterfactuals and adversarial examples, offering new perspectives on explainability and robustness. The first part of the talk introduces group-wise sparse adversarial attacks that exploit structured perturbations to fool classifiers while remaining interpretable. By leveraging a two-phase optimization algorithm—combining a non-convex 1/2-quasinorm proximal step with projected Nesterov’s accelerated gradient descent—this method crafts perturbations that concentrate in semantically meaningful regions. The resulting adversarial examples are not only effective and computationally efficient but also provide visual explanations of model vulnerabilities. Building on this, the second part focuses on sparse, plausible counterfactual explanations (S-CFEs)—perturbations that minimally modify inputs while remaining aligned with the data manifold. Addressing the inherent complexity of this non-convex optimization problem, the proposed approach employs accelerated proximal gradient methods capable of handling non-smooth ℓₚ (0 ≤ p < 1) sparsity-inducing regularizers. This yields counterfactuals that are actionable, model-agnostic, and efficient to compute, enabling deeper insights into model decision boundaries through human-interpretable examples. Finally, the talk explores how these plausible counterfactuals (p-CFEs) can be used to improve model robustness. By training classifiers on p-CFEs labeled with induced incorrect target classes, it is shown that models can learn to ignore spurious correlations and generalize better—even outperforming models trained on adversarial examples in mitigating bias. This novel training paradigm demonstrates that counterfactuals are not only diagnostic tools but also constructive elements for improving fairness and reliability. Together, these contributions advance the field of explainable Artificial Intelligence by tightly integrating methods for generating, analyzing, and learning from inverse classifications. The proposed frameworks provide practical tools for interpreting deep learning systems and lay a foundation for future work on training more transparent and trustworthy models.

Date
Oct 13, 2025 11:00 AM — 1:00 PM
Event
Technische Universität Berlin
Location
Berlin, Germany
Shpresim Sadiku
Shpresim Sadiku
PhD in Mathematics @TUBerlin
MSc in Data Science @TUM