How to Automate Agent Trajectory Analysis with GitHub Copilot: A Step-by-Step Guide

Introduction

If you've ever spent hours sifting through hundreds of JSON files to understand how an AI coding agent behaved during benchmark testing, you know it's tedious work. Multiply that by dozens of tasks and repeated runs, and you're facing hundreds of thousands of lines of data. As an AI researcher, I automated this exact process using GitHub Copilot—and so can you. This guide walks you through creating your own analysis agents, from recognizing the repetitive patterns to building and sharing a tool that saves your whole team time. By the end, you'll have a custom system that lets you focus on insights instead of manual data crunching.

How to Automate Agent Trajectory Analysis with GitHub Copilot: A Step-by-Step Guide — Source: github.blog

What You Need

GitHub Copilot (active subscription and IDE integration, e.g., VS Code)
Access to agent trajectory data – typically JSON files from benchmark runs (e.g., TerminalBench2, SWEBench-Pro)
Familiarity with JSON and basic Python or JavaScript – your Copilot code generation will be in one of these languages
A code editor or IDE that supports Copilot chat and inline suggestions
A shared repository (e.g., GitHub repo) to store your agents and collaborate
Command-line / terminal experience for running scripts and tests

Step-by-Step Instructions

Step 1: Identify Repetitive Analysis Patterns

Before writing code, take a close look at the trajectory files. You'll likely see the same questions popping up: Which steps failed most often? How many times did the agent retry after an error? What was the average time per task? Open a sample trajectory in your IDE and ask Copilot to summarize it. For example, type a comment like // Extract all error messages from this trajectory and let Copilot suggest a script. Trace the pattern of your own analysis over a few runs. List the queries you repeat—this list becomes your feature roadmap.

Step 2: Use GitHub Copilot to Explore Trajectories

Now, write a quick exploratory script using Copilot's inline suggestions. Start with a blank Python or JavaScript file, describe your intent in comments, and accept Copilot's code completions. For instance, a comment like # Load all JSON files from the 'trajectories' folder should generate the file reading logic. Then, ask Copilot to count error types, extract action sequences, or visualize time distributions. Use Copilot Chat to refine the code without leaving your editor. By iterating this way, you turn your manual inspection into reproducible code snippets.

Step 3: Build a Custom Analysis Agent

Once you have a collection of useful scripts, combine them into a single program—your analysis agent. Structure the agent as a command-line tool that takes a benchmark folder as input and outputs a summary report. Write a function for each pattern you identified in Step 1. Use Copilot to generate a main dispatcher that runs all analyses. For example, a comment like // Orchestrate all analysis modules and print results will get you started. Test the agent on a small subset of trajectories to ensure each module works correctly. Name your project something memorable, like eval-agents, and push it to a GitHub repository.

Step 4: Automate and Share with Your Team

To maximize impact, make your agent easy for colleagues to use and extend. Add a README.md with setup instructions and usage examples. Use GitHub Actions to run the agent automatically on each new benchmark submission. Encourage teammates to submit pull requests with new analysis modules. Document the architecture so others can author their own agents without reading your entire codebase. As contributions grow, you shift from being the sole engineer to a facilitator—your team becomes self-sufficient in analyzing agent performance.

Tips for Success

Keep it modular – Design each analysis as an independent plugin so teammates can add new insights without breaking existing ones.
Test on sample data – Always run your agent on a small, known dataset first; catching bugs early saves hours of confusion.
Leverage Copilot's chat for documentation – Ask Copilot to generate docstrings and markdown summaries to maintain clarity.
Foster collaboration – Create a lightweight template for new agent modules and share it in your repository's wiki.
Iterate based on real use – Pay attention to which analysis questions your colleagues ask most; those are prime candidates for new features.

You've now automated your intellectual toil. Your next job is to maintain the system—but that maintenance becomes a creative challenge rather than repetitive drudgery. Welcome to agent‑driven development.