Yeqiao Fu

Student, University of Hong Kong

u3597466@connect.hku.hk

Agent Tool Calling Ability Evaluation Framework

Independent prototype on top of Arkitect (Volcengine Ark Group)

Time. Summer 2025
Affiliation. ByteDance Volcengine Ark Group
Role. Agent Systems Intern; sole designer & implementer of the prototype

Tagline. Making Arkitect agents measurable and diagnosable via modular tool-calling tests and transparent logs.

Summary.
This project turns Arkitect from a “tool-using SDK” into something that can also evaluate tool use. I designed and implemented a prototype framework that registers heterogeneous tools, runs them through configurable test suites, validates outcomes with both rules and LLM-as-judge checks, and emits human-readable logs so that failures are explainable rather than opaque. It serves as a research-style foundation for future tool-calling benchmarks inside the Arkitect ecosystem.

Highlights.

Keywords. LLM agents, tool calling, evaluation, Arkitect, MCP, logging, observability.

Links. Built on Arkitect within Volcengine (internal codebase; no public repository).