Data is the least glamorous layer in any AI engagement and the one most demos cheat on. The demo points at a curated set of clean documents; production hands you a SharePoint folder, a vendor PDF API, and a scanned-paper archive nobody owns. The work to make those usable — parsing, chunking, lineage, permissions — is most of what the engagement turns out to be.
What this layer covers
- Where the source of truth lives, and how we re-fetch it
- Parsing source documents into a structured representation
- Tracking lineage from raw bytes to indexed chunks
- Modelling row-level permissions so the retrieval layer can pre-filter
- Defining "fresh" — daily batch, streaming, once-and-done
- Storing raw + parsed + chunked forms with content-addressable hashes
The four questions that shape the design
1. Where is the source of truth?
SharePoint, S3, an EDR or DMS, a vendor API, a paper archive behind a flatbed scanner. The answer determines whether the ingest pipeline is a webhook-driven event source, a scheduled batch crawl, or a one-time backfill. We need write-once URIs we can re-fetch — not "the file Bob emailed us." If we cannot repeatedly resolve the same document to the same bytes, lineage is impossible and the index becomes unreproducible.
2. How are permissions modelled?
Most business document corpora have row-level access — analyst A can see documents 1-200, analyst B can see 50-300, the auditor can see everything. The retrieval layer must respect this. The permission model has to be carried as metadata on every chunk in the index, so retrieval can pre-filter before the embedding endpoint sees anything the user is not allowed to. See Layer 04 — Retrieval for why post-retrieval filtering is a compliance leak.
On day one, we settle: which identity attribute (group membership, role, department, security clearance) governs access, and how it's resolved from the orchestration request. If the customer cannot answer this question precisely, the engagement stops until they can — building anything else first is wasted work.
3. What does "fresh" mean?
Daily ingest, streaming, once-and-done? The right answer drives everything from the pipeline shape to the eval cadence:
- Once-and-done — a one-time historical corpus. Backfill in batch, then the pipeline is mostly maintenance. Most "process the last 10 years of contracts" engagements live here.
- Daily / scheduled — most production cases. New documents land overnight; the pipeline catches up by morning. Lambda + Step Functions, no need for streaming infrastructure.
- Streaming — when "fresh" means seconds, not hours. Rare. Usually means Kinesis or MSK, with all the operational cost that implies. We rarely recommend this and almost never as a starting point.
4. What gets kept alongside the raw bytes?
Always: the parsed form, the chunk manifest, and the lineage record (parser version, chunker version, embedding model version). Together they let us re-derive the index from raw bytes after any upstream change, without going back to the source.
Default reference architecture
Storage
S3 in the data account, KMS-encrypted with a customer-managed key. Versioned bucket. Object Lock on regulated workloads. Two prefix layouts side by side:
raw/{corpus}/{content-hash}.{ext}— the original bytes, untouched.parsed/{corpus}/{content-hash}.{parser-version}.json— the parsed representation, addressable by both content hash and parser version.
Same content hash addresses both: parse the raw, store the parsed,
re-derive cheaply if either side changes. When the parser
upgrades, we generate a new {parser-version}
suffix and leave the old one in place for any in-flight
investigations.
Ingest pipeline
- Source emits an event (S3 event, webhook, scheduled crawl) → SQS queue in the data account.
- Ingest Lambda dedupes by content hash, writes the raw bytes to
raw/, emits a parse event. - Parser Lambda (or ECS task for large or slow parsers) reads the raw, parses, writes to
parsed/. - Chunker Lambda reads the parsed form, applies the corpus-specific chunking strategy, writes the chunk manifest (DynamoDB for small corpora; S3 for large) including document ID, section anchor, byte range, corpus version.
- Permission Lambda tags each chunk with the identity-tag metadata derived from the source system.
- Embedder Lambda batches chunks, calls Bedrock embeddings, writes vectors into the index alongside their metadata.
Every step logs the corpus version it was operating on, so a re-ingest after a chunker or parser upgrade is a clean state transition, not a reconciliation puzzle.
Parsing
The parser choice is corpus-specific:
- Scanned / paper-origin documents — AWS Textract. Strong table and form extraction, region-available, audit-friendly.
- Born-digital PDFs / structured documents — Unstructured.io or LlamaParse. Layout-aware, preserves section headings, list structure, table cell relationships.
- Office documents (.docx, .xlsx, .pptx) — native libraries (python-docx, openpyxl) or LibreOffice headless. Avoid converting to PDF first; you lose structure.
- HTML / Markdown — direct, with a layout-aware extractor (Trafilatura, Readability) for stripping chrome.
Mistral OCR and Azure Document Intelligence are alternatives we use when the customer already pays for them or when a specific benchmark favours them on the domain. The choice is settled by running both on a representative sample with a small accuracy rubric — not by reading the marketing page.
Lineage manifest
Every chunk in the index carries:
- Document ID + content hash + source URI
- Parser ID + parser version
- Chunker ID + chunker version
- Embedding model ID + version
- Permission tag set + the policy version that produced it
- Ingest timestamp
The lineage manifest is what lets us answer "where did this answer come from", "can the audit team prove the customer's data never left their account", and "if we upgrade the parser, what is the scope of the re-ingest". Without it, none of those questions are answerable.
Build vs. buy at this layer
Default: build the pipeline, buy the substrate. The storage (S3, Postgres, OpenSearch) is bought — nobody is building object storage from scratch. The ingestion, parsing orchestration, chunking, lineage, and permission-tagging are built, because they encode the customer's specific data shape and access rules.
The premature-buying anti-pattern: "data platform for AI" vendors that promise to handle ingestion + chunking + retrieval generically. The genericness is the problem. The wrapper changes over time; the integration to the customer's permission system breaks each time it does. By contrast, a small custom pipeline on S3 + SQS + Lambda stays stable for years.
The five mistakes we see
1. Parsing on the way in, then discarding the raw bytes
Saves storage; costs you any future re-parse. Cheaper to keep both. S3 is the cheapest storage tier in production AI by an order of magnitude.
2. Permissions retrofitted
Pipeline built without permissions in mind, then the customer asks for row-level access two months in. The result is usually a post-retrieval filter — which is a compliance leak. Always bake the identity-tag schema in at ingest.
3. No content hash, no dedup
The same document re-ingested fifty times because the source system re-emits events on every metadata change. Content addressing fixes this in one line and saves real money on every layer downstream.
4. Streaming-first when batch is enough
The shape that costs the most operational overhead for the least benefit. Most "we need real-time data" requirements turn out, on inspection, to be "we need this morning's data by 9 AM." Scheduled batch is fine for the latter. Reserve streaming for cases that legitimately need it.
5. Lineage as a wiki page
The parser version + chunker version + embedding model version tracked in a doc somewhere instead of as metadata on every chunk. The doc goes stale; the metadata stays correct. Six months in, the wiki is wrong and the only honest answer to "which parser version produced this answer" is "I have to go look."
How it connects to the other layers
Layer 02 feeds Layer 04: the chunks and their permission tags ARE the input to the index. It depends on Layer 01: the KMS keys, the cross-account roles, the private endpoints. It is the substrate that lineage is enforced on, and the data the audit trail ties answers back to.
Related: the infrastructure layer reference architecture, the retrieval layer reference architecture, the tooling catalog, and the concepts & standards glossary.