Matt Pocock's Skills: 122k Stars and a TDD Method That Actually Works

Smars
Skills , Testing
10 Jun, 2026

Matt Pocock’s Skills repository sits at 222k installs and 122k stars not because it’s revolutionary. It’s popular because it fixes a specific failure mode that every AI coding agent user encounters: the code it produces doesn’t work, and you don’t know until you run it.

The repo contains agent skills distilled from Matt’s daily .claude workflow. Big applications are hard. External APIs fail, databases deadlock, race conditions appear only under load. Approaches like GSD or Spec-Kit try to solve this by owning the entire process with heavyweight phase gates — but in doing so, they take away your control. You become a project manager instead of an engineer. Matt’s approach is radical in its simplicity: small, composable skills you pick and choose. No process ceremony. No hidden complexity. If you don’t want the testing skill, you don’t install it.

Among these skills, the TDD one is the most engineered, with the most supporting documentation (including dedicated tests.md for good/bad test examples and mocking.md for mocking guidelines), and arguably the one that generates the biggest shift in code quality. You install it once, run /tdd in your agent, and the next time you ask it to build something, it writes failing tests one at a time, implements just enough to pass, then moves to the next behavior. No more green CI badges hiding an application that does absolutely nothing.

Why Agents Write Crap Tests

When you ask an AI agent to implement a feature, it usually produces something like this:

test verifiesBilling()
test verifiesShipping()
test verifiesTaxCalculation()
test appliesDiscount()
test handlesRefund()
→ [then implements all handlers]

All tests first, then all code. Matt calls this “horizontal slicing” and the diagnosis is accurate. Horizontal slicing treats RED as “write all tests” and GREEN as “write all code.” This produces the worst of both worlds: tests that are brittle (because they were never paired with implementation) and code that has no test-driven shape.

The problems cascade:

Tests written in batch tend to mock internal collaborators because mocking is easier than wiring real dependencies through the system. They test private methods because mocking dependencies requires refactoring the production code first, and agents don’t want to do work “after” tests. They verify call counts and argument orders through mock spy assertions because that’s what test frameworks conventionally reward: .toHaveBeenCalledTimes(1), .toHaveBeenCalledWith(expected).

But a test that cares about call counts is testing an implementation detail. It says “this function called that function once,” not “this is the observable behavior users and callers care about.” The warning sign: your test breaks when you refactor, but the behavior hasn’t changed. If you rename an internal function and tests fail, those tests were testing implementation, not behavior.

Vertical Slices: The Corrective

The TDD skill inverts the approach. Every feature gets sliced vertically:

# Horizontal (BAD)
RED:  test1()  test2()  test3()  test4()  ← write all tests
GREEN: codeA()  codeB()  codeC()          ← write all implementation
REFACTOR: nothing because tests aren't reliable
# Result: lots of tests, nothing that works end-to-end

# Vertical (GOOD)
RED:   test1()
GREEN: → codeA()
REFACTOR: → clean up
RED:   test2()
GREEN: → codeB()
REFACTOR: → clean up
# Result: each red-green-refactor cycle produces a working increment

Vertical slicing forces incremental feedback. Each test exercises a real, working capability. You never reach a state where tests pass but nothing actually works. When the agent writes test("user can checkout with valid cart") and immediately implements checkout() to pass it, you have a working checkout. Then the next test adds test("discounts apply at checkout") and extends the existing function. The API grows organically, driven by actual tested behaviors.

The Three Rules

The skill’s philosophy rests on three interconnected rules.

Rule 1: Tests verify behavior through public interfaces, not implementation details. Code can change entirely. Tests shouldn’t. A good test reads like a specification written in a domain language: “user can checkout with valid cart” tells you exactly what capability exists, without mentioning any internal function name or interface. A bad test reads like an implementation walkthrough: “checkout calls paymentService.process” or “createUser saves a row to the users table.” The first tells you what the system does. The second tells you how — and the how is the part most likely to change.

Rule 2: Integration tests over unit tests with mocks. Test through real code paths using public APIs. When testing checkout, create a real cart object, add a real product, call checkout(), and assert on the result. Don’t mock paymentService and assert that paymentService.process() was called with certain arguments. The integration test tells you the system works. The mocked test tells you two internal functions talked to each other.

Mock only at system boundaries: external payment APIs (Stripe, PayPal), email services, storage providers, databases when explicitly needed, the file system, time, and randomness. Never mock your own classes, your own modules, or internal collaborators. If a module calls another module in your codebase, test the caller through its public API and let the callee exercise its own tests. When you reach a boundary you can’t control (a network call, a database query, a file write), design an interface for that boundary and inject or mock it.

The diagnostic metric is brutally simple: rename an internal function or move a private method to a different file. If your test breaks, it was testing implementation.

Rule 3: Planning before code. The skill includes a planning phase where you explicitly confirm interface changes, prioritize behaviors to test (starting with the most risky or uncertain), and design for testability before writing anything. This isn’t documentation ceremony. In practice it looks like: “Here’s my plan to add PDF invoice generation. New public API would be generateInvoice(order) -> PDF. The behaviors to test are: happy path with single item, multi-item items with discounts, empty cart returns error. The module I need to touch is invoiceService.ts which calls stripeService at a system boundary.” This takes thirty seconds. It catches three kinds of misalignment: wrong interface shape, wrong behavior priority, wrong module target. The worst case is wasting thirty seconds planning the wrong thing. The best case is avoiding an hour of agent-generated code you decide to delete.

Good Tests vs Bad Tests — Concrete Contrast

The skill’s tests.md provides concrete examples with clear before/after comparisons. Here’s the full pattern:

Good test — integration style, public API only, one logical assertion:

test("user can checkout with valid cart", async () => {
  const cart = createCart();
  cart.add(product);
  const result = await checkout(cart, paymentMethod);
  expect(result.status).toBe("confirmed");
});

This tests observable behavior. Uses public API only. Survives internal refactors. Describes what, not how. One logical assertion per test.

Bad test — mocks internal collaborator:

test("checkout calls paymentService.process", async () => {
  const mockPayment = jest.mock(paymentService);
  await checkout(cart, payment);
  expect(mockPayment.process).toHaveBeenCalledWith(cart.total);
});

Red flags: mocks an internal collaborator, tests an internal function call, the test name describes how not what, verifies call counts.

Bad test — bypasses interface to verify external state:

test("createUser saves to database", async () => {
  await createUser({ name: "Alice" });
  const row = await db.query("SELECT * FROM users WHERE name = ?", ["Alice"]);
  expect(row).toBeDefined();
});

// Good alternative: verifies through interface instead of direct DB query
test("createUser makes user retrievable", async () => {
  const user = await createUser({ name: "Alice" });
  const retrieved = await getUser(user.id);
  expect(retrieved.name).toBe("Alice");
});

This pattern appears everywhere in agent-generated code. The agent tries to verify database writes by querying the database directly. The right approach is to query through the public getUser() API. Same information. Different intent: one tests implementation, the other tests behavior.

Designing for Mockability: Two Patterns

At system boundaries, the skill recommends two patterns that make testing easier without introducing complexity:

1. Dependency injection. Pass external dependencies as parameters rather than creating them inside the function:

// Easy to mock in tests
function processPayment(order, paymentClient) {
  return paymentClient.charge(order.total);
}

// Hard to mock: internal construction hides the dependency
function processPayment(order) {
  const client = new StripeClient(process.env.STRIPE_KEY);
  return client.charge(order.total);
}

The dependency injection version requires one extra parameter. In a test, you pass a mock. In production, you pass new StripeClient(process.env.STRIPE_KEY). This is not boilerplate this is isolability.

2. SDK-style interfaces over generic fetchers. Create specific typed functions for each external operation instead of one generic fetch(uri, options) that requires conditional logic to mock:

// GOOD: Each function independently mockable with one shape
const api = {
  getUser: (id: string) => fetch(`/users/${id}`),
  getOrders: (userId: string) => fetch(`/users/${userId}/orders`),
  createOrder: (data: CreateOrderInput) => fetch('/orders', { method: 'POST', body: data }),
};

// BAD: Mocking requires conditional logic inside the mock implementation
const api = {
  fetch: (endpoint: string, options: RequestInit) => fetch(endpoint, options),
};

The SDK approach gives three benefits: each mock returns one specific shape (no conditional logic in test setup), you can see exactly which endpoints a test exercises, and each endpoint has its own type signature for type safety.

The Red-Green-Refactor Discipline

The classic three-phase TDD cycle has one critical rule that the skill emphasizes: refactor only after all tests pass. The GREEN phase gets the test green. The REFACTOR phase cleans up duplication, extracts shared logic, deepens modules, and applies SOLID principles — but only with the test suite actively babysitting you. This isn’t pedantic TDD purity. It’s about avoiding the scenario where you refactor code and something silently breaks because you removed a guard clause or changed a return type.

The skill organizes refactoring guidelines around three practical concerns:

Duplication extraction. When two tests exercise nearly identical code paths with different parameters, extract the shared setup into a helper function. Don’t wait for the third occurrence.

Module deepening. If a function handles three different responsibilities (validates input, formats output, persists to storage), split it. Deep modules expose simple interfaces with rich implementations — one public function that does one thing well, with internal complexity behind a clean boundary.

SOLID principles as post-hoc cleanup. The skill treats SOLID as dev-excused refactoring, not pre-implementation architecture. You don’t design for single responsibility before writing the test. You write the test, pass it, see the function growing responsibilities, then extract.

Real World Reasoning: When the Skill Shines and When It Doesn’t

The skill is excellent for features and bugfixes where behavior is well-defined and observable. Here are the concrete scenarios where the vertical-slice TDD approach generates the most value:

Domain logic — pricing rules, discount calculations, permission checks, data transformation pipelines. These are precisely the things that are easy to spec incorrectly and expensive to fix after deployment.
API endpoints and request handlers — each endpoint becomes a testable unit with clear input/output contracts.
Data validation flows — rule engines, schema validators, form processing chains. Each validation rule gets tested independently.
Bugfixes — instead of patching ha Hungry, write the test that reproduces the bug (red), fix the code to make it pass (green), then refactor with confidence.
Refactoring existing code — if a module is messy but has tests, the skill’s refactoring guidelines let you safely improve its structure.

It’s less useful when:

Exploratory prototyping — when you’re sketching ideas and don’t know the interface yet, writing tests before code slows you down more than it helps
UI styling and CSS — visually verified changes don’t benefit from automated tests
Configuration-only changes — if the change is editing a JSON config or a schema with no runtime behavior, tests add overhead without value
When your test framework isn’t set up — the skill assumes you have a working test runner with at least basic configuration. If you’re building from absolute scratch, the installation and configuration overhead can outweigh the benefit for very small projects

How to Install and Run

One-line install using npx skills:

npx skills add https://github.com/mattpocock/skills --skill tdd

Install the full collection:

npx skills@latest add mattpocock/skills

After installation, run /setup-matt-pocock-skills in your agent to configure issue tracker integration and triage labels. Then use /tdd to activate the skill for a session.

The skill automatically brings in its supporting documentation: tests.md (good and bad test examples) and mocking.md (when to mock and when not to). You don’t need to reference these manually — the skill loads them.

Why TDD Matters More for AI Agents Than for Humans

Human-driven test-driven development has existed since the 1990s. Kent Beck popularized it with Extreme Programming. Martin Fowler wrote the canonical t dd-definition article. Thousands of engineers have written red-green-refactor cycles across millions of projects. The method itself is not new.

What’s new is that AI coding agents have a strong natural bias against the parts of TDD that actually produce value. Agents want to show completeness. When asked to implement a feature, they produce all tests first, then all code — the exact horizontal slicing pattern the skill warns against. They tend to mock internal collaborators because mocking is easier: it’s one line (jest.mock(something)) versus wiring real dependencies through function parameters. They produce tests that verify call counts and argument orders because that’s the path of least resistance and what test framework examples conventionally show.

The skill’s constraints work precisely because they override these default behaviors. Vertical slicing forces the agent to stop after one test, implement it, verify it passes, then move on. No more generating fifty tests and hoping some of them are correct. Integration-only mocking prevents the dependency injection anti-pattern that turns every test into a mock-spaghetti nightmare. The planning phase catches misalignment before the agent burns tokens generating code you don’t want.

These aren’t subtle improvements. They’re structure against the natural path-of-least-resistance tendencies of LLMs. The agent wants to be helpful by being comprehensive. TDD wants to be minimal by being incremental. The skill makes the agent do what it’s naturally reluctant to do: write less code, less tests, less everything — until it has to write more.

The Takeaway

TDD in the AI agent context isn’t about test coverage percentages or green CI badges. It’s about a feedback loop that keeps the agent aligned with what the code should actually do. Vertical slices ensure each test translates to working functionality. Integration tests ensure tests survive refactors. Planning ensures you’re building the right thing before you write anything.

The Matt Pocock TDD skill isn’t a silver bullet. You still need to verify the generated code makes sense. You still need to read what the agent produces. But the skill removes the single biggest source of confidently wrong code: a test suite that passes but doesn’t test what you think it tests.

If you’re using an AI coding agent daily, install the TDD skill and run it on your next feature. The difference between horizontal and vertical slicing isn’t theoretical — you feel it when your test suite doesn’t break after renaming something internal. And that feeling is the definition of code you can trust.

GitHub: https://github.com/mattpocock/skills | Skills: https://www.skills.sh/mattpocock/skills/tdd

Matt Pocock's Skills: 122k Stars and a TDD Method That Actually Works

Why Agents Write Crap Tests

Vertical Slices: The Corrective

The Three Rules

Good Tests vs Bad Tests — Concrete Contrast

Designing for Mockability: Two Patterns

The Red-Green-Refactor Discipline

Real World Reasoning: When the Skill Shines and When It Doesn’t

How to Install and Run

Why TDD Matters More for AI Agents Than for Humans

The Takeaway

Tags :

Share :

Related Posts

lark-cli: The Official Lark CLI That Puts 2500+ APIs in Your Terminal

Obsidian Skills: teach agents to actually work with your knowledge base

scroll-world Turns Any Brand Into a Scrollable 3D World

22 Claude Code Skills for End-to-End Content Creation: From Generation to Publish in One Workflow

Book2Skills: turn books into agent skills that actually work

Style × Layout: How baoyu-skills' Visual Design System Makes AI Draw Better Than You Can Design