Nova Labs is currently on pause. New product purchases are unavailable. The blog remains live as an archive of the experiment.
Back to blog

How to integrate Claude Code into your CI/CD pipeline

April 11, 2026 10 min read

When you write code by hand, you know what you wrote. You can defend every line in a review. You know which decisions were deliberate and which were shortcuts. AI-generated code does not come with that context. It looks confident, compiles cleanly, and passes basic tests. Then three weeks later someone discovers it was quietly ignoring error cases, using a deprecated API, or solving the wrong problem.

The fix is not to distrust AI-generated code. The fix is to treat it the same way you treat any code that arrives without full context: run it through quality gates before it merges.

This post covers four things: why AI code needs CI gates, how to score your CLAUDE.md as part of every pull request, how to set up a GitHub Actions workflow that catches common AI code issues, and how to monitor config drift before it degrades your whole team's output.

Why AI-generated code needs quality gates

AI coding assistants make a specific kind of mistake. They are very good at producing plausible code and not great at knowing when a pattern does not fit your specific context. If your CLAUDE.md says "use the repository pattern" but does not explain how your repositories are structured, Claude will implement a repository that looks right but does not integrate with the rest of your system.

These mistakes are harder to catch in review than obvious bugs. The code is idiomatic. It uses correct syntax. It even has tests. But it bypasses your error handling layer, imports the wrong logger, or skips the validation middleware that every other route goes through.

Three things go wrong most often with AI-generated code in team codebases:

  • Convention drift: AI picks its own style when yours is underspecified. Over time, a codebase developed with AI assistance develops subtle inconsistencies that are invisible until you try to refactor.
  • Missing test coverage: AI writes tests when asked, but may not know which scenarios matter most for your domain. Without a gate, important edge cases stay untested.
  • Config staleness: Your CLAUDE.md was accurate six months ago. Since then, you changed frameworks, restructured directories, or updated your patterns. Claude is still following the old rules.

Automated gates catch all three. They do not require code reviewers to memorize every convention. They run the same checks on every commit, every time.

Scoring your CLAUDE.md in CI with ContextKit

The most overlooked CI check for AI-assisted codebases is scoring the config file that drives your AI assistant. If your CLAUDE.md has a score below 5 out of 10, Claude Code is guessing at conventions, file structure, and testing patterns. Every AI-generated PR is working from an incomplete map.

The ContextKit CLI scores your CLAUDE.md from the terminal. It returns exit code 1 when the score falls below your threshold, which integrates cleanly into any CI system.

npx contextkit score --min 5

A score below 5 means your CLAUDE.md is missing at least two of the five core categories: structure, architecture, conventions, testing setup, or guardrails. Any of those gaps leads to AI-generated code that drifts from your standards.

You can also check what the score is without failing the build, useful for informational reporting:

npx contextkit score

This outputs a full breakdown by category, specific suggestions for what to add, and a numeric score. Add the --min flag when you want to enforce a floor.

You can also analyze your config in the browser at nova-labs.dev/contextkit/analyze - paste the file, get the score and suggestions instantly.

GitHub Actions workflow

Here is a complete workflow that runs CLAUDE.md scoring, linting, and tests on every pull request. Add this file to your repo at .github/workflows/ai-quality.yml:

name: AI Code Quality

on:
  pull_request:
    branches: [main, develop]
  push:
    branches: [main]

jobs:
  config-quality:
    name: Check CLAUDE.md quality
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Score CLAUDE.md
        run: npx contextkit score --min 5
        # Fails if score is below 5 (exit code 1)
        # Remove --min 5 to report score without blocking

  code-quality:
    name: Lint and test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: '20'
          cache: 'npm'

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint
        # Use --max-warnings 0 for AI-assisted codebases
        # AI code often passes lint with warnings; treat warnings as errors

      - name: Run type check
        run: npm run type-check

      - name: Run tests
        run: npm test -- --coverage

A few things worth explaining in this workflow:

The config-quality job runs independently from the code quality job. If your CLAUDE.md score drops below 5, the PR is blocked regardless of whether the code itself passes tests. This creates the right pressure: fixing a degraded config file is a prerequisite for merging, not an optional cleanup item.

The linter step uses --max-warnings 0 (or equivalent for your linter). AI-generated code often passes lint with minor warnings because the AI knows the rules but applies them inconsistently. Treating warnings as errors forces those inconsistencies to be resolved.

The test coverage threshold is a judgment call. 80% line coverage is a reasonable starting point for AI-assisted codebases. The AI writes tests when it writes code, so coverage tends to be higher than in purely hand-written codebases, but the tests often miss the edge cases that actually matter. A coverage gate ensures that at minimum, the happy path is tested.

Pre-commit hooks with Claude Code

CI catches problems after the fact. Pre-commit hooks catch them before a commit ever happens. For AI-assisted development, where a single session can produce dozens of files, catching issues locally is faster and less disruptive.

Install pre-commit and create a .pre-commit-config.yaml file in your repo root:

repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.5.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-merge-conflict
      - id: check-json
      - id: check-yaml

  - repo: local
    hooks:
      - id: contextkit-score
        name: Score CLAUDE.md
        language: system
        entry: npx contextkit score --min 5
        files: CLAUDE\.md$
        pass_filenames: false
        # Only runs when CLAUDE.md is modified

      - id: eslint
        name: ESLint
        language: system
        entry: npx eslint --max-warnings 0
        files: \.(js|ts|jsx|tsx)$
        types: [file]

      - id: test-coverage-check
        name: Check test files exist
        language: system
        entry: python3 scripts/check_test_coverage.py
        files: src/.*\.(js|ts)$
        pass_filenames: true

The contextkit-score hook only runs when CLAUDE.md is modified. This means it does not slow down every commit, only the ones that change the config file. When someone edits CLAUDE.md and the score drops below 5, the commit is blocked until they fix it.

The test coverage check hook is a custom script that verifies test files exist alongside new source files. Here is a minimal version of that script:

#!/usr/bin/env python3
# scripts/check_test_coverage.py
import sys
import os

missing = []
for filepath in sys.argv[1:]:
    # Skip test files themselves
    if '.test.' in filepath or '.spec.' in filepath:
        continue
    # Only check source files in src/
    if not filepath.startswith('src/'):
        continue

    # Derive expected test file path
    base = filepath.replace('src/', 'tests/', 1)
    name, ext = os.path.splitext(base)
    test_path = f"{name}.test{ext}"

    if not os.path.exists(test_path):
        missing.append(f"  Missing: {test_path} (for {filepath})")

if missing:
    print("New source files are missing test files:")
    for m in missing:
        print(m)
    sys.exit(1)

This is blunt but effective. AI-generated code often comes with tests for the obvious paths. The script just makes sure a test file exists at all. You can extend it with actual coverage checks once you have the baseline in place.

Automated testing strategies for AI-written code

The testing challenge with AI-generated code is that it comes with tests already. The AI wrote them. The issue is not missing tests, it is tests that validate the wrong things.

AI tends to test the happy path thoroughly and the edge cases inconsistently. It writes tests based on what it knows about the function, not based on what could actually go wrong in your system. This produces high coverage numbers that give false confidence.

A few strategies that work better than relying on AI-generated tests alone:

Review tests before reviewing code

In code review, look at the test file first. What scenarios does it cover? What scenarios are missing? This is faster than reading the implementation and tells you just as much about whether the AI understood the requirements.

Add a required test categories checklist to your CLAUDE.md

Tell Claude explicitly what to test. A generic instruction like "write tests" produces generic tests. Specific instructions produce better coverage:

# Testing Requirements
- Every function that handles user input must test the invalid input case
- Every async function must test the rejection/error path
- Every function that calls an external service must test the failure case
- Never mock internal modules; use real implementations with test data
- Happy path + at least 2 edge cases minimum per function

Use mutation testing on AI-generated code specifically

Mutation testing tools (Stryker for JS/TS, mutmut for Python) introduce small changes to your code and check whether tests catch them. AI-generated tests often fail mutation testing spectacularly: they test that a function runs without errors, not that it produces the right output. A 10-minute mutation test run on a new AI-generated module will tell you more than a coverage report.

Monitoring config drift in team repos

In a solo project, CLAUDE.md drift is annoying. In a team repo, it becomes a consistency problem. Developer A updates a directory structure, developer B adds a new test framework for a specific module, developer C changes the deployment process. Nobody updates CLAUDE.md. Within a month, the config file describes a project that no longer exists.

The result: every team member using Claude Code is working from a different mental model of the codebase, and all of them are wrong.

Three practices that prevent this:

Add CLAUDE.md review to your PR template

Add a checkbox to your pull request template: "Does this change require updating CLAUDE.md?" This is not automated, but it creates a moment of deliberate consideration. Add it to .github/pull_request_template.md:

## Changes
[Describe what this PR does]

## Checklist
- [ ] Tests added or updated
- [ ] Documentation updated if needed
- [ ] CLAUDE.md updated if this changes file structure, conventions, or tooling

Run a weekly config score check

Add a scheduled workflow that runs npx contextkit score weekly and posts results to your team Slack or creates a GitHub issue when the score drops. This catches drift that individual PRs miss because the overall config degrades gradually.

name: Weekly Config Audit

on:
  schedule:
    - cron: '0 9 * * 1'  # Every Monday at 09:00 UTC

jobs:
  audit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Score CLAUDE.md
        run: npx contextkit score --min 5

Pin CLAUDE.md reviews to architecture decisions

When you make a significant architectural decision, update CLAUDE.md in the same commit. This keeps the config file aligned with actual decisions rather than lagging behind them. Treat CLAUDE.md as living documentation that changes when the system changes, not after.

What this looks like in practice

A team that has implemented these gates sees a different kind of failure mode. Before: AI-generated code passes review, gets merged, causes subtle bugs in production three weeks later, no one can trace why. After: AI-generated code fails the pre-commit hook because the test file is missing, or the PR is blocked because a recent refactor dropped the CLAUDE.md score to 4, or the weekly audit flags that the config no longer mentions the new authentication module.

The failures are earlier, more obvious, and cheaper to fix. That is the actual value of these gates: not preventing AI from writing code, but catching the mismatch between what the AI was told and what the codebase actually needs.

None of this requires a complex setup. The GitHub Actions workflow above takes about 30 minutes to add to an existing project. The pre-commit hooks take another 15 minutes. The CLAUDE.md scoring check is one line.

Start with the CLAUDE.md score gate. If your config is already solid, the check will pass and add zero friction. If it surfaces a score below 5, fixing it will immediately improve every Claude Code session on your team.

Score your config now at nova-labs.dev/contextkit/analyze, or run it directly in your repo with the ContextKit CLI: npx contextkit score.

Want to build your own AI OS?

The AI OS Blueprint gives you the complete system: 53-page playbook, working skills, and a clonable repo. Starting at $47.

30-day money-back guarantee. No subscription.