TABULA-R²: A Reproducible Tabular Reasoning Benchmark for Local LLMs
Independent research with Prof. Philipp Koehn (Johns Hopkins University)
Time. 2025
Affiliation. Johns Hopkins University (remote) + The University of Hong Kong
Role. Independent researcher; end-to-end designer & sole implementer
Tagline. Fully reproducible PLAN/END benchmark for evaluating multi-table reasoning of locally deployed LLMs.
Summary. TABULA-R² decouples genuine reasoning ability from formatting tricks by forcing models to emit executable programs before answering, across realistic data from Our World in Data.
Highlights.
- Curated and cleaned 129+ tables spanning health, economics, environment, demographics, and education, grouped into coherent multi-table collections.
- Authored and validated 263 human-written questions (single-table, multi-table, distractor variants) covering arithmetic, conditional logic, language inference, and entity alignment.
- Designed a PLAN/END prompting protocol plus a lightweight DSL (filter, aggregate, join, align) so models must output reasoning plans that my Python executor can run deterministically.
- Implemented a robust executor + unified validator that tolerates minor formatting drift and can fall back to a local LLM judge for True/False adjudication.
- Produced 80+ pages of behavioral analysis: multi-table contexts halve success rates, distractors only help when they clarify schema, and strict protocols can let zero-shot prompts beat few-shot/CoT.
- Shipping as a complete artifact with data, code, and documentation for reproducible benchmarking.
Keywords. LLMs, tabular reasoning, benchmark, DSL, evaluation, reproducibility.
Links. GitHub repository • Technical report (PDF)