AlignmentTailBench
Lightweight eval harness for ranking prompts by reward variance, CVaR, lower-tail reward, and failure examples.
Repository coming soonResearch Scientist · Frontier Model Evals, Alignment Robustness & Trustworthy AI
I work on stress-testing model behavior beyond average-case evaluation: residual knowledge after unlearning, prompt-specific reward tail risk, predictive multiplicity, and hidden failure modes in generative systems. My goal is to turn uncertainty, privacy, and robustness failures into operational evals for model behavior, alignment, and intervention verification.
I care about building reliable evaluation methods for frontier models and AI systems: tests that expose where models are brittle, unstable, or unsafe even when aggregate metrics look strong. Current and previous threads include:
Lightweight eval harness for ranking prompts by reward variance, CVaR, lower-tail reward, and failure examples.
Repository coming soonStress test for unlearning and model editing: finds cases where models pass forgetting checks but recover knowledge under perturbations.
Repository coming soonDemo for surfacing unstable individual decisions across near-equivalent models or checkpoints with similar aggregate performance.
Repository coming soonShows that models can appear to forget under standard tests while retaining residual knowledge under perturbed deletion queries.
Studies watermarking behavior in low-entropy LLM regimes where naive watermarking may be brittle or distortive.
Detects hybrid human/LLM text rather than assuming documents are fully human- or machine-written.
Develops unlearning methods for generative image-to-image models.
Analyzes conflicting individual predictions among similarly accurate gradient boosting models.
Uses dropout-based exploration to estimate model multiplicity more efficiently.
Studies how group-level fairness constraints can induce individual-level arbitrariness.
Introduces a metric for quantifying how much near-optimal models disagree on individual decisions.
Frames prompt-specific reward distributions and tail outcomes as signals for inference-time alignment risk.
Protects private attributes while preserving downstream utility through stochastic data substitution.
Uses information-theoretic transformations to suppress multiple protected attributes while retaining utility.
Develops fairness interventions for multi-class prediction using information projection.
Extends correspondence analysis for representation and machine-learning applications.
Leading work on adversarial evaluation, unlearning verification, reward and decision instability, and model safety governance.
Developed evaluation methods for residual knowledge, LLM watermarking, partial-LLM text detection, hallucination probing, and predictive multiplicity.
Worked on information-theoretic tools for fairness, privacy, representation learning, and decision uncertainty.
Thesis: Information-Theoretic Tools for Machine Learning Beyond Accuracy — fairness, privacy, and decision uncertainty.
Best Master’s Thesis Award and National Young Scholar Best Paper Award.