Layer 07 — Evaluation | Orion Research

Evaluation is the layer that determines whether the rest of the stack is allowed to ship. It is also the layer most teams under-invest in, because it does not produce a working demo. The harness is not a thing you bolt on; it is the contract that makes "the system works" mean something. Without it, every model bump is a roll of the dice and every customer escalation is a debate about feelings.

What this layer covers

Test sets, co-authored with the customer's domain experts
Scoring rubrics matched to the domain
The three default scores: faithfulness, coverage, refusal correctness
LLM-as-judge, human spot-checks, calibration
Golden sets — the small, high-quality "if this regresses, stop"
CI integration, model-bump triggers, regression detection

The harness as contract

Every Orion AI engagement starts with a co-authored success criteria document. One page. It lists the tasks the system must handle, the inputs it should handle them on, and the scores it must clear to be considered shipping. The harness reports against this document, and the report is the only definition of "the system works" that we and the customer share.

If we cannot agree on a measurable definition before the engagement starts, the engagement is not ready to start. We will say so on the first call. See Principle 01.

The three default scores

Every engagement gets the same three scores by default, plus domain-specific scores as the customer's success criteria require.

Faithfulness

Does the answer follow from the retrieved context, or did the model fill in plausible-sounding text from training? Scored case-by-case by comparing the answer against the chunks retrieved. A high faithfulness score means the model is using the corpus; a low score means it is hallucinating fluently.

Coverage

Did the retrieval layer surface the chunks that contain the answer? Scored against a labelled set where the expected source chunks are annotated. Low coverage means a retrieval problem; the fix is upstream of the model.

Refusal correctness

When the answer is not in the corpus, does the system say so — or does it invent a confident-sounding answer? Scored on a deliberately constructed sub-set where the correct answer is "the corpus does not contain this." Refusal is a first-class behaviour, not a fallback, and a system scoring 100% on the other two while refusing correctly 40% of the time is dangerous in production.

Default reference architecture

Test set authoring

Co-authored with the customer's domain experts. Real questions from real users, mixed with the deliberately constructed edge cases — out-of-corpus queries (for refusal), ambiguous queries, queries that depend on permission scoping. The full test set grows to 500-2000 cases on a typical engagement. The golden set within it is smaller and human-reviewed.

Synthetic test sets — generated by an LLM from the corpus — miss the cases that actually matter. Useful as a supplement; never as the primary set.

Scoring

Most scoring uses LLM-as-judge with a strict rubric. The judge is given the input, the system's output, the retrieved chunks (for faithfulness), and the rubric. It returns a score plus a short justification. Cheap, fast, scales to the full test set.

Two disciplines around it. First, the judge has to be a different model family from the system under test whenever feasible — otherwise the judge inherits the same blind spots. Second, we validate the judge's calibration against a human-scored sample of ~100 cases at every harness update. If the judge disagrees with humans more than 10% of the time, the rubric is wrong; we fix the rubric.

Golden set

The 50-200 cases reviewed by a human, kept separately from the bulk test set. If the golden set regresses, the deploy stops — no exceptions, no overrides. This is what catches the cases LLM-as-judge missed.

Trigger schedule

The harness reruns on:

Every PR — CI gate, has to pass for merge.
Every nightly — full set against current production config, regression detection on the rolling baseline.
Every model version bump — full set against the new model, report to the customer before any production deploy.
Every parser, chunker, embedding model change — same as model bump.

The harness that does not auto-trigger is the harness that drifts into obsolescence. If the customer's team has to remember to run it, it is broken.

Build vs. buy at this layer

Default: build. This is the most important layer to own outright. The test set is the contract. The rubric encodes what "the system is working" means in the customer's domain. The harness re-runs on every change. The customer must own all of these — without them, the customer cannot defend the system to their own auditors.

Things to buy:

Test runners — pytest, vitest. Ordinary infrastructure.
Scoring frameworks — Promptfoo, Ragas. Useful as scaffolding; the rubrics and test cases are ours.
LLM-as-judge primitives — the judge model itself, via Bedrock or direct.

The test cases, the rubrics, the success criteria document, the judge prompts — all built, all owned by the customer at handoff.

The five mistakes we see

1. "We will write the test set after launch"

The test set is the success criteria. Writing it after launch means there was no success criteria at launch, which means the decision to launch was a judgement call. Sometimes that is fine; usually it is the start of an argument three months later about whether the system "should" be doing what it is doing. Write the test set first.

2. LLM-as-judge with no calibration

The judge scores everything, nobody checks the judge. The judge inherits the model under test's blind spots and the system silently passes its own evals. Always calibrate against a human-scored sample. Always.

3. Same model family for judge and system

Claude scoring Claude. The blind spots overlap. Use a different family when feasible; rotate periodically when it is not.

4. No refusal test

The test set only contains questions whose answers ARE in the corpus. The system learns to confabulate when they are not — high faithfulness on in-corpus, no signal on out-of-corpus. Always include the out-of-corpus sub-set.

5. Golden set with overrides

A regression on the golden set, the engineer overrides the block "just for this deploy", three sprints later the golden set's signal is dead. The whole point of the golden set is that it has no overrides.

How it connects to the other layers

Evaluation grades every other layer. A low coverage score points at Layer 04; a low faithfulness score points at Layer 03 or the prompt structure in Layer 05; a low refusal correctness score is usually a prompt structure problem; a low tool-call success rate points at Layer 06. The harness is where signals from every other layer surface in one place.

Without the harness, the observability layer only tells you the system has slowed or gotten expensive — not that it has gotten less correct.