How Do I Measure AI Software Development Tool Usage?

Cloud bills are scrutinized to the cent. AI coding tool spend analysis is complex, with token-based pricing, multiple subscriptions, and quick pilots hiding total costs until reviewed by finance.

This tutorial attempts to close that gap.

Soon you’ll have a CSV inventory with named owners and a defendable measurement for budget reviews.

What You’ll Build

A CSV report that details each AI software development tool used by teams, including ownership, monthly cost, usage level, and the amount of code surviving after sixty days. The first six columns come from a Python tool you run; the rest are added manually.

Success looks like this: answer “what does AI cost the engineering org this month, and what are they getting for it?” in a single glance.

Here is how the parts fit together:

graph TB subgraph Sources["Vendor data sources"] V1[Anthropic Admin API] V2[OpenAI Admin API] V3[GitHub Copilot] V4[Cursor Team API] V5[AWS Cost Explorer] V6[Google Gemini] end subgraph Tooling["Automated collection"] T[Python tool
ai-software-tool-usage-report] C[CSV: ai-tool-usage-YYYY-MM.csv] end subgraph Manual["Manual enrichment"] O[Owner column
one named human per row] G[gh CLI
durable PR count] M[Cost per durable PR] end subgraph Decision["Decision loop"] R[Monthly review
first Tuesday] D[Cancel / Resize / Keep] end V1 --> T V2 --> T V3 --> T V4 --> T V5 --> T V6 --> T T --> C C --> O O --> M G --> M M --> R R --> D style Sources fill:#e1f5fe style Tooling fill:#f3e5f5 style Manual fill:#e8f5e8 style Decision fill:#fff3e0

What You’ll Learn

How to run a Python tool that pulls cost, seat, and token data from major AI software development tool vendors.
How to attribute spend to a single named owner per tool.
How to compute a token-cost-per-merged-pull-request (PR) metric that ignores throwaway code.
How to run the first monthly review so it produces decisions.

Time estimate: 1 to 2 hours. Most of that involves collecting admin API keys from vendors.

Difficulty: Intermediate. Assumes comfort with the command line, Python, and reading vendor admin docs.

Prerequisites

Required:

Admin (or billing-admin) access to every AI vendor account the org pays. Anthropic, OpenAI, GitHub Copilot, Cursor, AWS, Google.
git and a Unix-like shell.
uv (Python package manager). Install with brew install uv or per the uv install guide.
About 15 minutes of the finance partner’s time.

Not required:

A FinOps platform. The tool writes a CSV that opens in any spreadsheet.
A custom MLOps stack. The tool calls vendor admin endpoints directly.
Buy-in from leadership before starting. Numbers earn buy-in.

Setup

Step 1: Generate the cost data

The ai-software-tool-usage-report pulls monthly cost, seat, and token data from every major AI coding vendor and writes a CSV. Its README walks through cloning, installing with uv sync, configuring vendor-specific credentials (Anthropic and OpenAI admin keys, Copilot PAT, Cursor team key, Gemini ADC, AWS profile), and running the report. Follow it end-to-end before continuing.

The output is ai-tool-usage-YYYY-MM.csv for the current calendar month, with one row per vendor.

Checkpoint: the CSV exists. Configured vendors have a non-empty monthly_cost_usd; skipped vendors say so in the notes column.

Tutorial Steps

Step 1: Review the CSV

Open ai-tool-usage-YYYY-MM.csv in a spreadsheet. The tool’s README explains each column. The key insight during the review is the seat-versus-usage delta for rows where pricing_model = seat: a 200-seat license with 60 active users covers 140 ghosts, making canceling or resizing straightforward.

Checkpoint: The total of monthly_cost_usd matches finance’s number within five percent. A larger gap indicates missing vendors, often pilot accounts, marketplace add-ons, or personal API keys charged to a corporate card.

Step 2: Assign one human owner per tool

The CSV does not include an owner column, so please add one manually. Assign a single named human per row — no team alias, Slack channel, or “TBD.” The owner updates the CSV monthly and reports the numbers at the review scheduled in Step 4.

If a tool has no plausible owner, that is the answer for that tool. Cancel it at the next renewal. Write CANCEL in the notes column and move on.

You should see: every row has a person’s name in Owner. Some rows have CANCEL in notes.

What just happened: AI spends balloons most often because no one owns a single line item. Adding a name to each row is the simplest, most effective change in this tutorial.

Checkpoint: Every row has an owner reachable today by DM to confirm last month’s expenses.

Step 3: Build a token-cost-per-merged-PR baseline

Velocity numbers are gameable. The goal here is a metric tied to code that has proven to survive.

The formula:

$$ CPD = \frac{C}{N} $$

Where:

$CPD$ = cost per durable PR
$C$ = monthly tool cost (USD)
$N$ = durable PR count (merged 30-60 days ago, excluding reverts)

The 30-to-60-day window matters. Reverts within a month indicate noise, not output, so counting them isn’t accurate.

Use gh to gather the merged PR list, then filter out reverts and follow-up patches. Run it from the repo root:

# Merged PRs from the window 30 to 60 days back, excluding obvious reverts.
GH_DEBUG= gh pr list \
  --state merged \
  --search "merged:$(date -v-60d +%Y-%m-%d)..$(date -v-30d +%Y-%m-%d)" \
  --limit 1000 \
  --json number,title,mergeCommit \
    2>/dev/null \
    | jq '[.[] | select(.title | test("^Revert|^revert"; "i") | not)] | length'

Divide each tool’s monthly_cost_usd by that count and write the result in a new column called Cost per durable PR (USD).

You should see: a per-tool number defensible for finance. It is unflattering for at least one tool. That is the point.

What just happened: The metric that the finance partner can read is built. The 60-day survival rule is what makes the number defensible: it counts code that survived, not code that shipped and got reverted.

Checkpoint: The reasoning for excluding reverts can be explained in two sentences without jargon.

Step 4: Schedule the first monthly review

Create a recurring 30-minute meeting on the first Tuesday of every month. The agenda is one page:

Each Owner reads their row’s monthly_cost_usd, active_users, and Cost per durable PR.
Anything that moved more than 20 percent gets a one-sentence reason.
Anything still marked CANCEL gets a cancellation date.
New tools added that month get an owner before the meeting ends.

If a senior leader skips two months in a row, pause that tool’s spend at the next renewal. Please write that rule in the meeting invite description so everyone can see it.

You should see: a recurring calendar event with the agenda in the body and every tool owner invited.

Checkpoint: The first meeting is on the calendar with a written agenda. Start this month, not next quarter.

Verification

Done when all of these are true:

uv run python index.py writes a CSV with one row per vendor the org pays.
Every row has an Owner name and a Cost per durable PR value.
Sum of monthly_cost_usd matches finance’s number within five percent.
The first monthly review is on the calendar with all owners invited.

Finish every item before sharing the dashboard. Half-built dashboards lose credibility fast.

A Starting Point, Not the Finish Line

A mature team would automate every step here: scheduled CSV pulls, owner attribution from an HRIS, durable-PR counts streamed from GitHub webhooks, alerts when Cost per durable PR crosses a threshold, and a Grafana board replacing the spreadsheet. That is the right destination.

This tutorial is the starting point. A manual CSV with named owners and a defendable metric is enough to win the first budget review, surface the loudest waste, and earn the political capital to automate the rest. Build the pipeline once the workflow has proven its worth.

Troubleshooting

For tool-side problems (a row reads SKIPPED, AWS Kiro reports zero, Gemini cost stays blank), see the tool README’s Caveats. The cases below are about the workflow, not the tool.

Problem: The token-cost-per-PR metric looks great, but the team feels slower

Symptoms: The metric drops month over month, but PR review time, defect escape rate, or staging incidents climb.

Solution: add a Code breakage rate column. Count PRs from the same 30-to-60-day window that triggered an incident, a hotfix, or a revert after 30 days, and divide by total durable PRs. A speed gain that ships defects to QA shifts costs downstream.

If that does not work: cross-reference incident reports with Author and look for a correlation between AI-heavy PRs and incidents. Adding ai-suggested:high|low PR labels (via a pre-commit hook that scans for AI signatures) makes this much easier.

Problem: owners stop pulling their numbers after two months

Symptoms: The monthly review turns into one person reading the dashboard aloud while the owners stay quiet.

Solution: escalate at the third miss. The pattern usually breaks after one enforcement. Soft enforcement does not work; the calendar entry requires a written consequence.

If that does not work: Please move ownership of the row to whoever is willing to advocate for it. An unowned tool is a tool that can be canceled.

Problem: the cost numbers swing wildly week to week

Symptoms: weekly spend doubles, then halves, then doubles again.

Solution: switch to a rolling 4-week mean. Vendor pricing changes, model upgrades, and surge usage make any single week noisy. A rolling window smooths noise and preserves real trends.

If that does not work: pin model versions in the tool config so silent model upgrades stay out of the cost comparison.

Next Steps

To learn more:

Read the FinOps Foundation’s framework for cloud financial management. The same patterns apply to AI tool spend, with shorter feedback loops.
Read on this blog about measuring software ROI, which frames durable output for non-AI work.

To extend this project:

Add a vendor adapter to the tool — its README’s Development section walks through the VendorAdapter pattern.
Add an LLM gateway with semantic caching in front of the chattiest tool. A 30-60% token reduction for duplicate questions is a reasonable target.
Add ai-suggested PR labels via a pre-commit hook that scans for AI-authored hunks, and recompute Cost per durable PR split by AI-heavy versus AI-light PRs.
Add a quarterly total-cost-of-ownership column: subscription cost, minus saved engineering hours valued at fully loaded rates, minus extra QA and incident-response cost.
Add this interview question to the staff and principal engineer hiring loop: “How would you measure whether an AI coding tool is paying for itself on your team?” Candidates who think in cost-per-outcome are rare and worth the seat.

References

AI Software Tool Usage Report, the Python tool used in this tutorial. Its README covers vendor credentials, the run command, and CSV column reference.
Measuring software ROI, a related post on this blog that frames the durable-output question for non-AI work.
FinOps Framework, the FinOps Foundation’s reference model for cloud financial management. The principles port cleanly to the AI tool spend.
GitHub CLI: pr list, the command used in Step 3 to count merged PRs over a date window.
uv install guide, the package manager used in this tutorial.

What You’ll Build#

What You’ll Learn#

Prerequisites#

Setup#

Step 1: Generate the cost data#

Tutorial Steps#

Step 1: Review the CSV#

Step 2: Assign one human owner per tool#

Step 3: Build a token-cost-per-merged-PR baseline#

Step 4: Schedule the first monthly review#

Verification#

A Starting Point, Not the Finish Line#

Troubleshooting#

Problem: The token-cost-per-PR metric looks great, but the team feels slower#

Problem: owners stop pulling their numbers after two months#

Problem: the cost numbers swing wildly week to week#

Next Steps#

References#

Comments #