Programming

Automating Intellectual Toil: How Agent-Driven Development Transformed Copilot Applied Science

An AI researcher automated analysis of coding agent trajectories using GitHub Copilot, creating eval-agents to eliminate intellectual toil and empower team collaboration.

Published 2026-05-04 02:09:44 • Farkesli Staff

In software engineering, the urge to automate repetitive tasks often leads to building new systems—and sometimes, to completely redefining one's role. This is exactly what an AI researcher on the Copilot Applied Science team experienced. By creating the eval-agents tool, they eliminated the tedious analysis of coding agent trajectories and empowered their peers to do the same. Below, we dive into the key questions about this transformation.

What sparked the creation of the eval-agents tool?

The researcher's daily work involved analyzing coding agent performance using benchmarks like TerminalBench2 and SWEBench-Pro. Each task in these benchmarks produces a trajectory—a JSON file detailing the agent's thoughts and actions. With dozens of tasks per benchmark and multiple runs each day, the total data often exceeded hundreds of thousands of lines of code. Initially, the researcher used GitHub Copilot to surface patterns, reducing the reading load to just a few hundred lines. However, the repetitive loop of “use Copilot, spot patterns, investigate” grew tiresome. The idea to automate this intellectual toil itself led to the birth of eval-agents.

Automating Intellectual Toil: How Agent-Driven Development Transformed Copilot Applied Science — Source: github.blog

What were the main design goals for the eval-agents project?

The guiding principle was that engineering and science teams work better together. The implementation strategy had three specific goals:

Make agents easy to share and use – leveraging GitHub's collaborative DNA to ensure low friction.
Make it easy to author new agents – so team members could create custom solutions quickly.
Make coding agents the primary vehicle for contributions – encouraging everyone to build and share agent-based solutions.

These goals built on the researcher's experience as an OSS maintainer on the GitHub CLI, which emphasized shareability and simplicity.

How did the researcher use GitHub Copilot before automating?

Before eval-agents, the researcher relied on GitHub Copilot to analyze benchmark trajectories. Copilot was used to surface patterns in the data, allowing the researcher to identify areas of interest without reading every line. This reduced the workload from hundreds of thousands of lines to only a few hundred. The process was iterative: run Copilot analysis, investigate the patterns, adjust, repeat. While effective, this manual loop was still repetitive and consumed time that could be spent on deeper creative work, motivating the push toward full automation.

What is a 'trajectory' in coding agent evaluation?

A trajectory is a detailed record of a coding agent's actions and thought processes while solving a task from an evaluation benchmark. Each task in datasets like TerminalBench2 generates its own trajectory file, typically in JSON format with hundreds of lines of code. These files capture every step—decisions, code modifications, and reasoning. Analyzing trajectories helps researchers understand agent behavior, identify failure modes, and measure performance. However, because there are many tasks and multiple runs, the total trajectory data can be enormous—often hundreds of thousands of lines.

How did the researcher's role change after automating the analysis?

The researcher jokingly said, “I may have just automated myself into a completely different job.” By building eval-agents, they replaced their own intellectual toil with automated processes. Instead of manually analyzing trajectories, they now maintain and improve the tool so that peers on the Copilot Applied Science team can also automate their analyses. The researcher shifted from being a primary analyst to a tool builder and enabler, helping others work faster and more creatively. This reflects a common pattern where automation changes job responsibilities toward more strategic roles.

What lessons did the researcher learn about collaboration with GitHub Copilot?

Through this project, the researcher discovered key lessons:

Copilot is not just a code generator—it's a pattern recognizer that can surface insights from large datasets.
Building shareable agents that are easy to use encourages team-wide adoption.
Making it simple to author new agents empowers every team member to solve their own unique problems.

The development loop became incredibly fast: with eval-agents, the team could iterate on analysis scripts quickly, and new agents could be created in minutes. This collaborative approach unlocked innovation and reduced repetitive work for the whole team.