Research Engineer · IBM Research, India
RL Post-Training · Agentic Reasoning · LLM Evaluation · Foundation Models for Structured Data
Research Engineer at IBM Research, India, with about a decade of work across applied research and engineering. I am currently exploring sample efficiency in GRPO-style RL fine-tuning for reasoning models. Generation dominates the compute budget of these methods, and a large fraction of the groups they produce are degenerate: every completion in the group either succeeds or fails, the empirical group-mean baseline cancels, and the advantage collapses to zero. The question, then, is what a training procedure should estimate, per prompt and per current policy, so that its rollouts yield groups whose advantages are non-zero and informative. A working paper on this is in preparation.
I also work on foundation models for structured (tabular) data. My most recent paper treats tables as a modality and pulls apart two problems that table-understanding work tends to bundle together: what the model has to learn about a table itself, despite header drift and cell-level noise, and how it should attend to the small subset of cells, and the surrounding text, that a given question actually depends on. The system is a structure-aware encoder fused end to end with an 8B-parameter decoder-only LLM, pretrained with a mixture of corruption-aware denoisers and aligned in fine-tuning to passages of linked external text (AISTATS 2026).
A separate strand is on LLM-based agents for root-cause diagnosis of ETL pipeline failures, now deployed in IBM's data integration product (SIGMOD 2026 Demo). A companion benchmark for evaluating this class of agents, DataBench, is under review at KDD 2026. Earlier work covered formal verification of individual fairness, concept-based ante-hoc explainability, AutoML pipeline configuration as constrained optimisation, MCMC-based synthesis of probabilistic programs, and automated test-input generation for ML systems.
Email is the best way to reach me.
Psychometric measurement for LLM evaluation and training. Past work used 2PL IRT to rank 91 vision models from 10 calibrated ImageNet items, with Kendall τ = 0.85 against the full-benchmark ranking (ICML DMLR 2024). I am currently extending this measurement framework to GRPO-style RL post-training, with a working paper in preparation.
The premise is that tables are a modality in their own right. The most recent paper separates two problems: teaching the model what a table is despite header drift and cell-level noise (via a mixture of corruption-aware denoisers over a structure-aware encoder), and getting it to attend to the cells and surrounding text that a question depends on (via end-to-end fusion with an 8B-parameter decoder-only LLM). AISTATS 2026.
Verifying and explaining what ML models actually do. Formal verification of individual fairness for tabular classifiers (UAI 2020), concept-based ante-hoc explanations (CVPR 2022), adversarial robustness via cascaded defenses (IJCNN 2019), and automated test-input generation for ML systems.
MCMC-based synthesis of probabilistic programs from observation traces (PLDI 2015). ADMM-based formulations of the Combined Algorithm Selection and Hyperparameter optimisation (CASH) problem for AutoML pipeline configuration (AAAI 2020).
I maintain two open-source projects.
rl-experiments
is a small PyTorch sandbox that compares RL post-training update rules on
bandit and sequence tasks. It studies what those update rules do to a policy
when reward is sparse, noisy, delayed, or vector-valued: which samples and
tokens an update should weight, how stale reused off-policy data can be, and
what drives entropy collapse. Companion setups cover on-policy distillation,
multi-objective optimization under vector-valued rewards, and GRPO for
tool-using LMs that recursively call themselves, where credit must propagate
over a rollout tree rather than a flat sequence.
minilab
runs the full training pipeline (pretraining, SFT, preference optimization,
RLVR) end to end on a single consumer GPU. At small scale, SFT and preference
tuning shift response format faster than task accuracy, since re-weighting
cannot add capability the base never learned, and GRPO produces no gradient
on zero-variance groups, where every rollout earns the same reward. minilab
also includes a masked-diffusion track that repeats the same four stages with
diffusion-native objectives, including diffusion analogues of DPO and GRPO.
Full publication list on Google Scholar and DBLP.