API Contract Testing + LLM Evals: The Safety Net for Product Changes
Introduction
Product changes fail in predictable ways:
- endpoints break compatibility,
- pagination or filtering semantics shift silently,
- error codes change and break clients,
- and AI-assisted refactors introduce behavior drift that unit tests don’t catch.
To avoid these failures, you need a safety net that operates at the boundaries:
- API contract testing for deterministic protocol correctness,
- and LLM evals for probabilistic quality and policy correctness (when AI is involved).
This article shows how to combine both into a CI strategy that keeps your product stable as engineering accelerates.
Section 1: What “API Contracts” Actually Include
An API contract is more than the request/response schema.
Your contract includes:
- response fields and types,
- pagination semantics (limits, cursors, ordering),
- filtering behavior and defaults,
- error code mapping,
- idempotency expectations,
- and rate limit headers/behavior.
Contract tests should validate the full set of invariants that clients depend on.
Section 2: Contract Testing for Deterministic Boundaries
Implement contract tests that validate:
- schema correctness (JSON structure),
- correct status codes for known scenarios,
- deterministic error payload formats,
- pagination and filtering rules,
- and backwards compatibility for “old” client request shapes.
For product engineering, contract tests should run in CI whenever:
- an endpoint is modified,
- shared libraries change,
- or schema versions are bumped.
Section 3: Where LLM Evals Fit (Probabilistic Quality)
If AI features are part of the API (AI summaries, triage outputs, agent decisions), contract tests alone don’t ensure quality.
LLM evals validate:
- output validity (schema, JSON parse, required fields),
- quality metrics (does the output satisfy the acceptance rubric?),
- safety/policy compliance,
- and regression across prompt versions or model routing changes.
You can treat evals as a “contract for output quality.”
Section 4: A Combined Test Matrix (Lean, Effective)
For most teams, an effective test matrix looks like:
- Contract tests (protocol correctness)
- LLM output validation (schema + policy checks)
- Golden set evals (quality regression)
- Slice-based evals (detect minority regressions)
Slice-based evals matter because a change might improve overall average quality while degrading a critical user segment.
Section 5: Example Golden Eval Design
Start with a small dataset:
- real production-like inputs,
- the expected rubric scores (or reference outputs when feasible),
- and metadata to group eval cases into slices.
Then run:
- evals for each prompt version,
- evals for each model routing change,
- and comparisons vs baseline.
Block merges when key metrics regress beyond thresholds.
Section 6: Practical CI Failure Policies
To make this work in real product teams, define merge policies:
- Contract test failures: always block.
- LLM schema failures: block.
- Quality regression above threshold: block.
- Minor quality drift: require explicit review and rollout plan.
Then document response actions so teams know what to do on failure.
Conclusion
API contract testing keeps product integrations stable.
LLM evals keep AI features correct, safe, and consistent.
When you combine both, you get a safety net that supports faster product iteration without silently breaking clients or degrading AI quality.