Offline + Online Eval: The Hybrid Testing Strategy for Production LLM Systems

Introduction

Traditional software testing assumes deterministic behavior: given input X, expect output Y. LLM systems are probabilistic: given input X, expect output in the neighborhood of Y—and "neighborhood" is hard to define.

48% of organizations running agents in production do not run offline evaluations. 63% skip online monitoring. The result: prompt changes ship without regression detection, model upgrades break production silently, and quality degrades over weeks before anyone notices.

The industry standard for 2026 is hybrid evaluation: offline golden datasets in CI/CD, combined with online monitoring of live traffic. Neither alone is sufficient.

Section 1: Offline Evals (Golden Datasets)

Offline evals run against a fixed dataset of inputs with expected outputs or quality criteria. They run in CI/CD on every change.

Building a golden dataset

Start with 20–50 cases covering:

Happy path: typical queries your system handles well,
Edge cases: ambiguous inputs, missing context, multi-step requests,
Adversarial cases: prompt injection attempts, out-of-scope requests,
Regression cases: bugs you have fixed in production (never delete these).

For each case, define:

{
  "input": "What is the refund policy for orders over $500?",
  "expected_contains": ["30-day", "refund"],
  "expected_not_contains": ["I don't know"],
  "max_latency_ms": 3000,
  "max_cost_usd": 0.05
}

Evaluation criteria

Not every eval needs exact match. Use the right criterion per case:

Criterion	Use when
Exact match	Classification, routing, structured output
Contains / not-contains	Open-ended responses with key facts
LLM-as-judge	Complex quality assessment (use a stronger model to grade)
Semantic similarity	Paraphrased correct answers (cosine > 0.9)
Tool call verification	Agent called the right tool with right args

CI/CD integration

Run offline evals on:

every prompt change,
every model version change,
every tool schema change,
every retrieval index update.

Block merge if pass rate drops below threshold (typically 85–95% depending on maturity).

Section 2: Online Evals (Live Monitoring)

Offline evals cannot cover the infinite input space. Online evals sample live traffic and assess quality continuously.

What to monitor

Response quality score: LLM-as-judge or human review on sampled responses,
Task completion rate: did the agent achieve the user's goal?,
Tool call accuracy: did the agent call the right tools?,
Latency and cost: p50/p95 per request, cost per task,
Error rate: failures, timeouts, guardrail triggers,
User feedback: thumbs up/down, escalation to human.

Sampling strategy

You cannot eval every request (cost and latency). Sample:

5–10% of traffic for automated quality scoring,
100% of failures and escalations for root cause analysis,
100% of high-cost requests (agent loops, large context).

Alerting thresholds

Metric	Warning	Critical
Quality score drop	> 5% from baseline	> 15% from baseline
Task completion rate	< 85%	< 70%
Cost per task increase	> 20% from baseline	> 50% from baseline
Error rate	> 2%	> 5%

Section 3: The Hybrid Strategy

Offline and online evals serve different purposes:

Dimension	Offline	Online
Coverage	Fixed, known cases	Infinite, real-world inputs
Speed	Fast (CI/CD)	Continuous
Catches	Regressions on known cases	Drift, new failure modes
Cost	Low (batch runs)	Moderate (sampling)
Blocks deploys	Yes	No (alerts only)

The workflow

Develop: engineer changes prompt/tool/model,
Offline eval: CI runs golden dataset → pass/fail gate,
Deploy: canary release to 5% of traffic,
Online eval: monitor quality/cost/latency on canary for 24–48 hours,
Promote or rollback: based on online metrics vs baseline,
Feed back: production failures become new offline eval cases.

Section 4: Implementing Eval Infrastructure

Minimum viable eval stack

Golden dataset: JSON file in your repo (version-controlled),
Eval runner: script that calls your LLM pipeline and scores results,
CI integration: GitHub Action that runs evals on PR,
Online sampling: middleware that logs 5% of requests with quality scores,
Dashboard: weekly review of offline pass rate + online quality trends.

Mature eval stack

Dedicated eval platform (LangSmith, Braintrust, or custom),
Automated LLM-as-judge with calibrated scoring,
A/B testing framework for prompt/model variants,
Automatic golden dataset expansion from production failures.

Section 5: Common Mistakes

Evaluating only happy paths: your golden dataset must include failures,
Exact match on open-ended responses: use semantic similarity or LLM-as-judge,
No regression cases: every production bug should become a permanent eval case,
Evaluating in production only: offline evals are your deploy gate,
Ignoring cost in evals: a correct answer that costs 10x is a regression.

Conclusion

Hybrid eval is the testing strategy production LLM systems need. Offline evals catch known regressions before deploy. Online evals catch unknown failures in the wild. Together, they close the quality loop.

Start with 20 golden cases and a CI script. Add online sampling next week. Expand the dataset every time production surprises you.