Offline + Online Eval: The Hybrid Testing Strategy for Production LLM Systems
Introduction
Traditional software testing assumes deterministic behavior: given input X, expect output Y. LLM systems are probabilistic: given input X, expect output in the neighborhood of Y—and "neighborhood" is hard to define.
48% of organizations running agents in production do not run offline evaluations. 63% skip online monitoring. The result: prompt changes ship without regression detection, model upgrades break production silently, and quality degrades over weeks before anyone notices.
The industry standard for 2026 is hybrid evaluation: offline golden datasets in CI/CD, combined with online monitoring of live traffic. Neither alone is sufficient.
Section 1: Offline Evals (Golden Datasets)
Offline evals run against a fixed dataset of inputs with expected outputs or quality criteria. They run in CI/CD on every change.
Building a golden dataset
Start with 20–50 cases covering:
- Happy path: typical queries your system handles well,
- Edge cases: ambiguous inputs, missing context, multi-step requests,
- Adversarial cases: prompt injection attempts, out-of-scope requests,
- Regression cases: bugs you have fixed in production (never delete these).
For each case, define:
{
"input": "What is the refund policy for orders over $500?",
"expected_contains": ["30-day", "refund"],
"expected_not_contains": ["I don't know"],
"max_latency_ms": 3000,
"max_cost_usd": 0.05
}
Evaluation criteria
Not every eval needs exact match. Use the right criterion per case:
| Criterion | Use when |
|---|---|
| Exact match | Classification, routing, structured output |
| Contains / not-contains | Open-ended responses with key facts |
| LLM-as-judge | Complex quality assessment (use a stronger model to grade) |
| Semantic similarity | Paraphrased correct answers (cosine > 0.9) |
| Tool call verification | Agent called the right tool with right args |
CI/CD integration
Run offline evals on:
- every prompt change,
- every model version change,
- every tool schema change,
- every retrieval index update.
Block merge if pass rate drops below threshold (typically 85–95% depending on maturity).
Section 2: Online Evals (Live Monitoring)
Offline evals cannot cover the infinite input space. Online evals sample live traffic and assess quality continuously.
What to monitor
- Response quality score: LLM-as-judge or human review on sampled responses,
- Task completion rate: did the agent achieve the user's goal?,
- Tool call accuracy: did the agent call the right tools?,
- Latency and cost: p50/p95 per request, cost per task,
- Error rate: failures, timeouts, guardrail triggers,
- User feedback: thumbs up/down, escalation to human.
Sampling strategy
You cannot eval every request (cost and latency). Sample:
- 5–10% of traffic for automated quality scoring,
- 100% of failures and escalations for root cause analysis,
- 100% of high-cost requests (agent loops, large context).
Alerting thresholds
| Metric | Warning | Critical |
|---|---|---|
| Quality score drop | > 5% from baseline | > 15% from baseline |
| Task completion rate | < 85% | < 70% |
| Cost per task increase | > 20% from baseline | > 50% from baseline |
| Error rate | > 2% | > 5% |
Section 3: The Hybrid Strategy
Offline and online evals serve different purposes:
| Dimension | Offline | Online |
|---|---|---|
| Coverage | Fixed, known cases | Infinite, real-world inputs |
| Speed | Fast (CI/CD) | Continuous |
| Catches | Regressions on known cases | Drift, new failure modes |
| Cost | Low (batch runs) | Moderate (sampling) |
| Blocks deploys | Yes | No (alerts only) |
The workflow
- Develop: engineer changes prompt/tool/model,
- Offline eval: CI runs golden dataset → pass/fail gate,
- Deploy: canary release to 5% of traffic,
- Online eval: monitor quality/cost/latency on canary for 24–48 hours,
- Promote or rollback: based on online metrics vs baseline,
- Feed back: production failures become new offline eval cases.
Section 4: Implementing Eval Infrastructure
Minimum viable eval stack
- Golden dataset: JSON file in your repo (version-controlled),
- Eval runner: script that calls your LLM pipeline and scores results,
- CI integration: GitHub Action that runs evals on PR,
- Online sampling: middleware that logs 5% of requests with quality scores,
- Dashboard: weekly review of offline pass rate + online quality trends.
Mature eval stack
- Dedicated eval platform (LangSmith, Braintrust, or custom),
- Automated LLM-as-judge with calibrated scoring,
- A/B testing framework for prompt/model variants,
- Automatic golden dataset expansion from production failures.
Section 5: Common Mistakes
- Evaluating only happy paths: your golden dataset must include failures,
- Exact match on open-ended responses: use semantic similarity or LLM-as-judge,
- No regression cases: every production bug should become a permanent eval case,
- Evaluating in production only: offline evals are your deploy gate,
- Ignoring cost in evals: a correct answer that costs 10x is a regression.
Conclusion
Hybrid eval is the testing strategy production LLM systems need. Offline evals catch known regressions before deploy. Online evals catch unknown failures in the wild. Together, they close the quality loop.
Start with 20 golden cases and a CI script. Add online sampling next week. Expand the dataset every time production surprises you.
Related reading: