## Introduction
I write extensively on [software fundamentals][fundamentals-series], covering architecture, security, reliability, testing, performance, accessibility, and process. The articles are lengthy, detailed, and checklist-rich.
I kept encountering the same issue: knowing fundamentals and applying them in code review are different. I'd review PRs, catch a timeout or a query, but forget accessibility or circuit breaker.
Human memory is unreliable. Checklists help, but reviewing 200+ items manually isn't practical.
I built skills that encode fundamentals into repeatable, automated fitness reviews. This article explains what they are, why they work, and how they fit together.
> Type: **Explanation** (understanding-oriented).
> Primary audience: **intermediate** developers who want to understand the design behind fitness review skills
## What Are Fitness Review Skills?
A fitness review skill instructs an AI code assistant (like Claude Code or Cursor) on how to evaluate a codebase for a particular quality, outlining a workflow, scoring rubric, and output format.
Think of it as giving a detailed checklist to a meticulous reviewer who never forgets an item or gets tired.
Each skill assigns 1-10 scores per evaluated dimension, supported by code file:line evidence. Without evidence, a score is an opinion; with reference, it becomes actionable.
The skills cover nine domains aligned with articles in the [fundamentals series][fundamentals-series]. Three focus on code structure: **architecture** (coupling, cohesion, layering, naming, API design), **security** (input validation, auth, data protection, cryptography), and **performance** (efficiency, caching, scalability). Two focus on correctness: **algorithms** (algorithm choice, data structure selection, concurrency safety, edge cases) and **data** (schema design, migration safety, data integrity, query correctness). Three address operational quality: **reliability** (observability, timeouts, CI/CD, incident readiness), **testing** (test pyramid, quality, coverage), and **accessibility** (semantic HTML, keyboard navigation, color contrast, screen readers). One pertains to the team: **process** (documentation, workflow, dependency management, organization).
Two skills connect everything: a **full review** orchestrator that runs all nine domain reviews simultaneously and creates a unified report, and a **JIT test generator** that generates catching tests for changed code.
## Why Fitness Scores Instead of Pass/Fail
Early reviews yielded binary results: "This is secure" or "This has vulnerabilities," which was unhelpful for three reasons.
Every codebase has issues. A pass/fail gate that always fails trains people to ignore it. Scores on a 1-10 scale show your current state and what you need to improve.
Second, projects need different standards. A prototype may accept a 4/10 process maturity, but a payment system cannot tolerate the same level of security. Scores help teams set thresholds per domain.
Third, scores track progress over time, showing if architecture improves or degrades with monthly reviews. A trend line is more useful than a snapshot.
The scoring rubric defines what good (8-10) and bad (1-3) looks like, preventing score inflation and ensuring comparability.
## How the Skills Work Together
Each domain skill follows the same pattern:
1. **Map the scope.** Find entry points, modules, configs, and test files related to that domain.
2. **Analyze using the checklist.** Walk through patterns from the fundamentals articles.
3. **Score each dimension.** Assign 1-10 scores with evidence from `file:line`.
4. **Produce the report.** Write findings, scores, and prioritized actions to a markdown file.
The full review orchestrator acts like an annual check-up for your codebase, sending code to nine specialists and synthesizing their findings into a report with a weighted score. Architecture and security carry the highest weights (15% each) because structural and security problems are the most expensive to fix later. The remaining domains each carry 10%.
```mermaid
graph TB
F[review-full] --> A[review-architecture]
F --> S[review-security]
F --> R[review-reliability]
F --> T[review-testing]
F --> P[review-performance]
F --> AL[review-algorithms]
F --> D[review-data]
F --> AC[review-accessibility]
F --> PR[review-process]
A --> Report[Unified Report]
S --> Report
R --> Report
T --> Report
P --> Report
AL --> Report
D --> Report
AC --> Report
PR --> Report
```
The JIT test generator reads pending git changes, identifies untested code paths, and generates targeted tests to catch regressions. It prioritizes external API boundaries, state mutations, error handling, and complex conditionals.
## Why Evidence Matters More Than Opinions
Every review finding must include a `file:line` reference; this mandatory detail is the most impactful design choice.
When a review says "coupling is too high," the natural response is "where?" or "says who?" When a review says "coupling is 4/10, circular dependency between `orders/service.py:15` and `payments/client.py:8`," the response is "let me look at that."
Evidence turns a review into a specific code conversation and keeps AI honest. If the skill can't cite an example, it shouldn't claim it, eliminating common complaints about vague, generic AI code review advice.
## The Relationship Between Skills and Articles
Each skill's checklist derives from specific fundamentals articles. For example, the architecture skill's naming dimension comes from [Fundamentals of Naming][naming], and the reliability skill's timeout hygiene aspect from [Fundamentals of Timeouts][timeouts].
This relationship is bidirectional. Articles explain the *why* behind practices, while skills encode the *what to check* into a repeatable process. Reading gives you the mental model; running the skill shows your codebase's current state against it.
The checklists are in each skill's `references/checklist.md`, citing source articles. Unclear items are explained by the source article.
**Skills repository.** The skills are open source in the [skills repository on GitHub][skills-repo]. That repository contains the installable SKILL.md definitions and checklists; this article explains the design and rationale. The repository links back here so you can move between code and explanation.
## Why Different Situations Call for Different Reviews
Not all reviews suit every project, and running all nine domains each time wastes effort.
The choice depends on context. Before shipping, a full review catches issues that individual reviews miss. During development, review the most changed domain. After an incident, a reliability review reveals gaps.
Some domains are only relevant in specific projects. Backend-only codebases lack frontends for accessibility checks, and projects without databases skip data reviews. Early-stage projects prioritize process and architecture reviews over performance optimization.
The JIT test generator doesn't score; it creates targeted tests for recent code changes, catching regressions when the risk is highest.
## Trade-offs and Limitations
These skills aren't a replacement for human code review; they identify structural patterns and anti-patterns but can't assess if a feature solves the right problem, the user experience, or the team's architectural trade-offs.
The scoring is opinionated, based on effective and failed approaches from many projects informed by fundamentals articles. Your team might disagree with specific score thresholds or weights. Since skills are in markdown files, adjusting rubrics is easy.
AI-powered reviews share limitations with static analysis: they can generate false positives and negatives. A security skill helps by needing a 7/10 confidence threshold before reporting, but no tool catches all issues.
Performance depends on codebase size. For large monorepos, running all nine domain reviews in parallel takes time. Running domain skills on changed files is faster and more targeted.
## Common Misconceptions
**"Automated reviews replace human reviewers."** They handle checklist tasks to let humans focus on design, trade-offs, and logic correctness. Think of them as a first pass catching mechanical issues before human review.
**"A score of 10/10 means the code is perfect."** A 10/10 means the code follows good practices but doesn't indicate if the feature is right or if the architecture fits business needs.
**"Low scores are bad."** Low scores in early projects are expected; prototypes should score low on process maturity and reliability. The score indicates current status, not future requirements. As projects mature, thresholds are raised.
**"Running the full review once is enough."** Fitness degrades over time. Code scoring 8/10 on architecture six months ago may drop to 6/10 after rapid development. Regular reviews detect trends and prevent costly degradation.
## Conclusion
The fitness review skills transform fundamentals articles into a repeatable process. They score your codebase across nine quality dimensions with `file:line` evidence, generate prioritized actions, and track fitness over time.
The mental model is simple: fundamentals articles teach *why* something matters. Fitness skills check *whether* your code follows through on that understanding. Together, they close the gap between knowing and doing.
## Next Steps
* Run `/review:review-full` on your current project to see where it stands.
* Clone the [skills repository][skills-repo] and install the skills for Claude Code or Cursor.
* Read the [fundamentals series][fundamentals-series] articles behind any low-scoring dimension.
* Adjust scoring thresholds in the SKILL.md files to match your team's standards.
* Run reviews regularly (monthly or per-milestone) to track fitness trends.
[fundamentals-series]: {{< ref "fundamentals-of-fundamentals" >}}
[jit-test-gen]: {{< ref "what-is-just-in-time-catching-test-generation" >}}
[naming]: {{< ref "fundamentals-of-naming" >}}
[timeouts]: {{< ref "fundamentals-of-timeouts" >}}
[skills-repo]: https://github.com/jeffabailey/skills