How to Automate Agent Trajectory Analysis with GitHub Copilot

Introduction

Software engineers often automate repetitive tasks to focus on creative work. As an AI researcher, I automated my intellectual toil using GitHub Copilot, creating a tool called eval-agents that streamlines analysis of coding agent trajectories. This guide walks you through building your own automated analysis workflow, enabling you and your team to collaborate efficiently on evaluating agent performance.

How to Automate Agent Trajectory Analysis with GitHub Copilot — Source: github.blog

What You Need

GitHub Copilot (installed in your code editor, e.g., VS Code)
Evaluation benchmark datasets (e.g., TerminalBench2 or SWEBench-Pro) with agent trajectories in JSON format
Basic understanding of Python or JavaScript for scripting agents
Git repository for version control and sharing
Time and patience for iterative development

Step-by-Step Guide

Step 1: Identify Repetitive Analysis Patterns

Start by examining your current workflow. When you review agent trajectories, note the recurring steps: opening JSON files, searching for specific actions, comparing success/failure outcomes, and summarizing patterns. Document these steps – they form the core of the automation.

For example, you might repeatedly look for the number of steps an agent took, the tools it called (e.g., code search, file read), or the final answer format. List these on a simple checklist.

Step 2: Use GitHub Copilot to Surface Initial Patterns

Before building a full agent, use GitHub Copilot interactively to analyze a few trajectories. In your editor, open one JSON file and type comments describing what you want to extract. Copilot will suggest code snippets for parsing and summarizing. Accept and refine these suggestions. This gives you a feel for the data structure and helps you design the automation.

Tip: Use the Copilot Chat feature to ask, “How do I count the number of times an agent calls a function?” and integrate the answer.

Step 3: Design a Shareable Agent Framework

Your goal is to create agents that are easy to share and extend. Define a simple interface: each agent takes a trajectory (or a set of trajectories) as input, performs analysis, and returns a structured output (e.g., a report in Markdown or a CSV file). Store the agents in a shared GitHub repository, and use clear naming conventions.

For example, create a base_agent.py class with methods like analyze(self, trajectory) and report(self). Document the interface using README files and inline comments.

Step 4: Implement Your First Agent for Automated Analysis

Now code the agent that automates your common analysis tasks. Using Copilot to speed development, write functions to:

Load multiple JSON trajectories from a folder.
Extract key metrics: number of steps, success rate, tool usage frequency.
Generate a visual or textual summary.

Leverage Copilot’s suggestions to handle edge cases, like missing fields or large files. Test the agent on a small set of trajectories first, then scale up.

Step 5: Test and Refine with Team Collaboration

Share your agent with colleagues. Let them test it on their own evaluation runs. Gather feedback: what patterns are missing? Is the output easy to understand? Use GitHub Issues to track improvements. Iterate by adding new analysis functions or improving performance. This collaborative loop ensures the tool meets real needs.

Encourage team members to contribute new agents by following your framework. This spreads the automation load and fosters a culture of shared productivity.

Step 6: Scale and Maintain

As your toolkit grows, maintain it by:

Writing automated tests for each agent.
Setting up a CI/CD pipeline to run agents on new benchmark data.
Documenting usage in a team wiki.

Consider creating a dashboard that aggregates results from multiple agents. This turns your manual analysis into a continuous, scalable process.

Tips for Success

Start small: Automate just one recurring task before expanding.
Use Copilot as a pair programmer: It accelerates coding and uncovers better patterns.
Embrace sharing: A shared repository with clear documentation multiplies the tool’s value.
Iterate based on real use: Your first agent won’t be perfect – team feedback is gold.
Don’t over-automate: Keep the human in the loop for interpreting nuanced results.

By following these steps, you can transition from manually analyzing hundreds of thousands of lines of trajectory data to having a team of automated agents that reveal insights in minutes. This transforms your role from a data analyst to a tool builder and enabler – a change that benefits everyone.