Concepts & standards — the vocabulary we use

The shared vocabulary we use on engagements. Definitions are opinionated — the goal is precision, not consensus. Where the field has multiple uses of a term, we say what we mean by it. Where we have a strong position on how to apply it in production, we say that too.

Organized roughly by the layer of the Quantum Leap stack the concept belongs to, then alphabetically within each. Skim or search.

Data & retrieval

Chunking

The act of splitting a source document into retrievable units. The units are typically embedded individually and indexed alongside metadata (document ID, section anchor, byte range, corpus version). Naive chunking is fixed-token-window. Production chunking is layout-aware — by clause, by line item, by section — so a chunk is a semantically meaningful unit, not an arbitrary slice. See Layer 04 — Retrieval.

Corpus

The body of source documents the AI system is grounded against. Distinct from "the training data" — a corpus is searched, not trained on. Every chunk in an index belongs to a corpus, and the corpus has a version (so a re-ingest after a chunker upgrade is traceable).

Embedding

A fixed-length numerical vector representation of a chunk of text (or image, audio, etc.), produced by an embedding model. The distance between two embeddings approximates semantic similarity of the underlying content. Embeddings are not portable between models — a Cohere embedding cannot be compared to a Titan embedding. Every embedding in production should carry the model ID + version that produced it.

Hybrid retrieval

Combining two or more retrieval methods (typically dense vector similarity + lexical / BM25) at query time, then fusing the rankings into one combined result. Outperforms pure-vector retrieval on any corpus containing named entities, numbers, or proper nouns — which is essentially every business corpus. RRF (Reciprocal Rank Fusion, k=60) is the default fusion method, and is roughly ten lines of code.

Lineage

The provable chain from a source document to a specific answer: document version → parser version → chunker version → embedding model version → index corpus version → retrieved chunks → answer. A production AI system without lineage cannot be audited. The lineage manifest is part of every deliverable.

RAG (retrieval-augmented generation)

A pattern where the model receives, at inference time, context retrieved from an index — usually because the model alone does not know the answer (private data, recent data, large corpus). RAG is not a model. It is not a framework. It is a pattern: retrieve relevant context, ground the generation on it, cite. In production, RAG almost always means hybrid retrieval + grounding instructions + citation logging + a refusal pattern for when retrieval misses.

RRF (Reciprocal Rank Fusion)

A fusion method for combining multiple ranked retrieval results. Each chunk receives a score of 1 / (k + rank) from each retrieval method, summed across methods. k is a constant — 60 is canonical. Top-N by combined score wins. Simple, robust, very effective at combining dense + lexical signals.

Models & generation

Fine-tuning

Adapting a foundation model to a specific domain by continued training on a curated dataset. Useful narrowly; over-recommended broadly. The most common mistake is to fine-tune before the retrieval layer is correct, then conclude the model is bad at the domain. Fine-tuning a model that gets the wrong context just memorises the wrong answer. Evaluate retrieval first, always.

Foundation model

A large model trained on broad data, used as the basis for many tasks without per-task training. Claude, GPT, Gemini, Llama, Mistral. In production at Orion, "foundation model" means a model we call via API or via Bedrock — we are not in the business of training foundation models from scratch.

Hallucination

A model output that is fluent but not grounded in the source material — invented citations, made-up entities, plausible numbers that do not exist. Hallucination cannot be eliminated; it can be constrained. The mitigations are: retrieval grounding, citation-required generation, refusal-on-uncertainty, and an evaluation harness scoring faithfulness. We will not promise zero hallucination on any engagement. Anyone who does is lying.

LLM-as-judge

Using one model to score the output of another against a rubric. Useful for evaluating generative outputs at scale, when human scoring is too expensive to run continuously. Risk: the judge inherits the same biases as the model under test, especially when they are the same family. Mitigations: explicit rubric, human spot-checks to validate the judge's calibration, and rotating judge model families when feasible.

Per-task model selection

Choosing the model per task based on capability, latency, and cost — not standardising on one model everywhere. A Claude Sonnet for reasoning, a Claude Haiku for classification, a Titan embedding for retrieval. The thin abstraction in the orchestration layer makes swapping cheap.

Orchestration & agents

Agent

An AI system that takes multi-step, tool-using actions toward a goal, with the model in the loop deciding which tool to call next. Distinct from a single-turn pipeline (retrieve → ground → answer). Most production AI is not agentic — most workflows are single-turn pipelines. We build agents when the workflow genuinely requires multi-step reasoning with tool use, not because the term is fashionable.

MCP (Model Context Protocol)

An open protocol for exposing tools, resources, and prompts to models in a model-agnostic way. Useful when we want a tool to be portable across engagements — a customer-data MCP server can be consumed by Claude on Bedrock, Claude direct, or any MCP-compatible client. We use MCP for tools we want to share across customers; native tool-use for engagement-specific tools.

Tool

A side-effecting capability exposed to the model: read a record, write a record, call an API, query a database. In production, every tool has a named owner, an audit log, and a clear boundary between reversible reads and human-in-the-loop side effects. Tools are how AI systems touch the business — and where most accountability concerns live.

Tool-use boundary

The explicit line between what the AI system is allowed to call autonomously and what requires a human review step. The boundary moves over time as confidence is earned, but it never moves without an evaluation harness signal. Production AI systems affecting real money or real records always have a human review boundary somewhere. See Guardrails.

Evaluation

Coverage

One of the three default scores in an Orion evaluation harness. Measures: did the retrieval layer return the chunks that contain the answer? Distinct from faithfulness (did the model use them correctly). Low coverage means a retrieval problem; low faithfulness on high coverage means a generation problem.

Evaluation harness

The collection of test sets, scoring rubrics, and runners that grade an AI system's output. The harness is the contract: the only definition of "the system is working" that we and the customer share. Re-runs on every model bump, parser change, chunker change, embedding model change. Owned by the customer, delivered as code. See Principle 01.

Faithfulness

One of the three default scores. Measures: does the answer follow from the retrieved context, or did the model fill in plausible-sounding text from training? Faithfulness is the score that catches hallucination in evaluation.

Golden set

A small, human-curated, high-quality test set used as the canonical "if this regresses, stop the deploy" benchmark. Smaller than the full test set; reviewed manually. Usually 50-200 cases co-authored with the customer's domain experts.

Refusal correctness

One of the three default scores. Measures: when the answer is not in the corpus, does the system say so — instead of inventing one? A system that scores perfectly on faithfulness and coverage but refuses correctly only 40% of the time is dangerous in production. Refusal is a first-class behaviour, not a fallback.

Regression detection

Running the harness on every change and flagging score drops above a threshold. The threshold is engagement-specific — for high-stakes domains it is 0%, for low-stakes it can be a few percentage points. Without regression detection, every model bump is a roll of the dice.

Governance & observability

Audit trail

The retained record of every model invocation: prompt, model ID + version, retrieved chunks, tool calls, final answer, user identity. Stored in a customer-managed KMS-encrypted store, retained per the compliance regime. Without an audit trail, an AI system cannot be operated through any non-trivial incident review.

Observability triad (AI version)

The four AI-specific metrics every production system should surface alongside the usual latency/error metrics: token spend, refusal rate, retrieval hit rate, tool-call success rate. Together these catch ~80% of production AI incidents earlier than the generic infrastructure metrics would.

Prompt injection

An attack where adversarial text is smuggled into the model's context — often via a document the model retrieves — to override the system prompt or trigger unauthorized tool calls. Defense is layered: input sanitization, structured system prompts that treat retrieved content as data only, and tool-call whitelisting. The single most important rule: no tool call from text extracted from a retrieved document. Document content is data. Never an instruction source.

Refusal-on-uncertainty

A pattern where the model is explicitly instructed to say "I do not know" or "the corpus does not contain this" when retrieval quality or its own confidence is below a threshold. Better than inventing a plausible-sounding answer. Reduces hallucination visible to users; surfaces retrieval coverage gaps for the operate team.

Engagement-shape concepts

Build-vs-buy

The decision, at each layer of the stack, of whether to build the capability ourselves or buy it from a vendor. Orion's default test is: is this layer a differentiator for the business or a commodity the business needs to function? Build differentiators, buy commodities. See the full framework essay and the tooling catalog.

Graduate or kill

The two valid exits from a Quantum Labs spike: graduate the work to a longer build engagement (or hand it off for the customer's team to take to production), or kill the project with honest reasoning. The third — "renew the spike off momentum" — is the one we will not take.

Handoff

The end-state of every Orion engagement: the customer's team can operate, redeploy, debug, and extend the system without us. Tested by whether the team can teardown and redeploy the stack to a new account in an afternoon. If they cannot, it is not a handoff.

Spike

A two-week, fixed-price engagement to prove or disprove that a proposed AI build is tractable. Ends with a working end-to-end pipeline tested against an agreed success bar, plus a written recommendation on whether to graduate, hand off, or kill. The only entry point into a Quantum Labs engagement.

Vertical AI

AI engineering applied to a specific domain — extracting structure from a customer's contracts, building an agent that uses the customer's internal tools, evaluating a model on the customer's workflow. Distinct from horizontal AI (frontier model research, general-purpose tooling). Quantum Labs builds vertical AI; we use horizontal AI as a substrate.

Missing a term you expected? Send a note — the glossary grows as the engagements do, and we are happy to add definitions that customers ask for in writing.

Concepts & standards — the vocabulary we use | Orion Research