Agent Evals: Task completion rate, trajectory evaluation, GAIA, SWE-bench

20 views • 3 weeks ago