Spider2-V: Multimodal Data-Science Web Agent Evaluation Platform
Research assistant & task designer (NeurIPS 2024 D&B Spotlight)
Time. 2023–2024
Affiliation. HKU XLANG Lab
Role. Contributing researcher (NeurIPS 2024 D&B Spotlight project)
Tagline. Benchmark multimodal web agents on GUI-heavy, real-world data-science workflows.
Summary. Spider2-V couples enterprise-style tools (dashboards, notebooks, file ops) with multimodal perception to stress-test agent planning, grounding, and GUI control.
Contributions.
- Co-designed Airflow-based tool/task graphs plus prompt templates so agents can orchestrate multi-step ETL + visualization workflows.
- Helped define the Thought → Action → Observation scaffold, adding pyautogui control hooks and DOM-representation ablations (accessibility tree vs. raw DOM) for fair comparisons.
- Implemented evaluation scripts and metrics for correctness, efficiency, and GUI robustness; also contributed to Dockerized environment initialization and dataset packaging.
- Worked with BLIP-2 and MiniGPT variants plus Hugging Face + LoRA finetuning to explore multimodal reasoning trade-offs.
Keywords. Multimodal agents, data-science automation, GUI control, evaluation, NeurIPS Spotlight.
Links. Paper, code, project page.