Back to Insights
2026-05-29 5 min read Tanuj Garg

Offline + Online Eval: The Hybrid Testing Strategy for Production LLM Systems

AI & Automation#LLM Evals#Testing#AI Quality#CI/CD#Production AI

Introduction

Traditional software testing assumes deterministic behavior: given input X, expect output Y. LLM systems are probabilistic: given input X, expect output in the neighborhood of Y—and "neighborhood" is hard to define.

48% of organizations running agents in production do not run offline evaluations. 63% skip online monitoring. The result: prompt changes ship without regression detection, model upgrades break production silently, and quality degrades over weeks before anyone notices.

The industry standard for 2026 is hybrid evaluation: offline golden datasets in CI/CD, combined with online monitoring of live traffic. Neither alone is sufficient.


Section 1: Offline Evals (Golden Datasets)

Offline evals run against a fixed dataset of inputs with expected outputs or quality criteria. They run in CI/CD on every change.

Building a golden dataset

Start with 20–50 cases covering:

  • Happy path: typical queries your system handles well,
  • Edge cases: ambiguous inputs, missing context, multi-step requests,
  • Adversarial cases: prompt injection attempts, out-of-scope requests,
  • Regression cases: bugs you have fixed in production (never delete these).

For each case, define:

{
  "input": "What is the refund policy for orders over $500?",
  "expected_contains": ["30-day", "refund"],
  "expected_not_contains": ["I don't know"],
  "max_latency_ms": 3000,
  "max_cost_usd": 0.05
}

Evaluation criteria

Not every eval needs exact match. Use the right criterion per case:

CriterionUse when
Exact matchClassification, routing, structured output
Contains / not-containsOpen-ended responses with key facts
LLM-as-judgeComplex quality assessment (use a stronger model to grade)
Semantic similarityParaphrased correct answers (cosine > 0.9)
Tool call verificationAgent called the right tool with right args

CI/CD integration

Run offline evals on:

  • every prompt change,
  • every model version change,
  • every tool schema change,
  • every retrieval index update.

Block merge if pass rate drops below threshold (typically 85–95% depending on maturity).


Section 2: Online Evals (Live Monitoring)

Offline evals cannot cover the infinite input space. Online evals sample live traffic and assess quality continuously.

What to monitor

  • Response quality score: LLM-as-judge or human review on sampled responses,
  • Task completion rate: did the agent achieve the user's goal?,
  • Tool call accuracy: did the agent call the right tools?,
  • Latency and cost: p50/p95 per request, cost per task,
  • Error rate: failures, timeouts, guardrail triggers,
  • User feedback: thumbs up/down, escalation to human.

Sampling strategy

You cannot eval every request (cost and latency). Sample:

  • 5–10% of traffic for automated quality scoring,
  • 100% of failures and escalations for root cause analysis,
  • 100% of high-cost requests (agent loops, large context).

Alerting thresholds

MetricWarningCritical
Quality score drop> 5% from baseline> 15% from baseline
Task completion rate< 85%< 70%
Cost per task increase> 20% from baseline> 50% from baseline
Error rate> 2%> 5%

Section 3: The Hybrid Strategy

Offline and online evals serve different purposes:

DimensionOfflineOnline
CoverageFixed, known casesInfinite, real-world inputs
SpeedFast (CI/CD)Continuous
CatchesRegressions on known casesDrift, new failure modes
CostLow (batch runs)Moderate (sampling)
Blocks deploysYesNo (alerts only)

The workflow

  1. Develop: engineer changes prompt/tool/model,
  2. Offline eval: CI runs golden dataset → pass/fail gate,
  3. Deploy: canary release to 5% of traffic,
  4. Online eval: monitor quality/cost/latency on canary for 24–48 hours,
  5. Promote or rollback: based on online metrics vs baseline,
  6. Feed back: production failures become new offline eval cases.

Section 4: Implementing Eval Infrastructure

Minimum viable eval stack

  • Golden dataset: JSON file in your repo (version-controlled),
  • Eval runner: script that calls your LLM pipeline and scores results,
  • CI integration: GitHub Action that runs evals on PR,
  • Online sampling: middleware that logs 5% of requests with quality scores,
  • Dashboard: weekly review of offline pass rate + online quality trends.

Mature eval stack

  • Dedicated eval platform (LangSmith, Braintrust, or custom),
  • Automated LLM-as-judge with calibrated scoring,
  • A/B testing framework for prompt/model variants,
  • Automatic golden dataset expansion from production failures.

Section 5: Common Mistakes

  • Evaluating only happy paths: your golden dataset must include failures,
  • Exact match on open-ended responses: use semantic similarity or LLM-as-judge,
  • No regression cases: every production bug should become a permanent eval case,
  • Evaluating in production only: offline evals are your deploy gate,
  • Ignoring cost in evals: a correct answer that costs 10x is a regression.

Conclusion

Hybrid eval is the testing strategy production LLM systems need. Offline evals catch known regressions before deploy. Online evals catch unknown failures in the wild. Together, they close the quality loop.

Start with 20 golden cases and a CI script. Add online sampling next week. Expand the dataset every time production surprises you.

Related reading: