All posts Sovereign AI

Sovereign, GDPR-compliant RAG: the complete 2026 guide

L'équipe RagNight · 16 min read · May 29, 2026

Designing a sovereign, GDPR-compliant RAG in 2026: storage location, controlled inference flows, multi-tenant isolation, and traceability — with pgvector, European open models, hybrid search, and reranking.

The widespread adoption of conversational assistants in the enterprise has surfaced a strategic question few organizations anticipated: where do your knowledge assets actually live, and who can access them? Adopting RAG (Retrieval-Augmented Generation) without thinking about data sovereignty means handing your company's informational heritage to a black box whose location, retention, and secondary uses you do not control.

This guide details, step by step, how to design a sovereign, GDPR-compliant RAG architecture without sacrificing answer quality or team velocity. It speaks as much to the CTO arbitrating architecture as to the DPO who will have to defend processing choices before a regulator, and to the Head of AI who must ship a genuinely useful assistant. You will not find a magic promise here: sovereignty has an engineering cost, but that cost is now entirely manageable with mature open-source building blocks.

Why sovereignty is no longer optional

For a long time, digital sovereignty was seen as the concern of an overly cautious IT department or lawyer. Three shifts have changed the picture.

First, value has moved to proprietary data. Foundation models have become widely available commodities: between Llama 4, Mistral Large 2, Qwen 3, and DeepSeek-V3 on the open side, and the Claude, GPT/o-series, and Gemini 2.x families on the closed-API side, raw generation capability no longer differentiates anyone. What sets a company apart is the internal corpus it feeds into those models. Contracts, procedures, support history, product documentation, R&D notes — these assets are now a competitive advantage you do not want to leak, nor to see used to train a vendor's model and, indirectly, your competitors'.

Second, the regulatory landscape has tightened. GDPR has required documenting the journey of every piece of personal data since 2018, but it is the EU AI Act that reshuffled the deck. In force since 1 August 2024, it rolls out obligations in stages: a ban on unacceptable practices and an AI-literacy requirement from February 2025, obligations for general-purpose AI (GPAI) models in August 2025, then the bulk of high-risk system obligations in August 2026, with some pushed to 2027. A RAG assistant touching recruitment, credit scoring, healthcare, or public services may fall into the "high-risk" category and trigger documentation, human-oversight, and risk-management duties. Anticipating this timeline is not over-caution: it avoids having to urgently re-architect a system already in production.

Third, incidents have multiplied. Prompt leaks containing customer data, conversations reused for training, undeclared non-EU subprocessors, transfers to the United States weakened each time an adequacy framework is challenged — every quarter brings reminders that blind outsourcing has a cost. When an employee pastes a confidential contract into a consumer assistant, the data has already left your perimeter, and no after-the-fact contract clause will bring it back.

Sovereignty does not mean bringing everything back in-house. It means knowing, at any moment, where each piece of data sits and under which jurisdiction it is processed — and being able to prove it on demand.

The four pillars of a sovereign RAG architecture

A sovereign architecture rests on four complementary pillars. None is sufficient on its own: perfectly localized storage protects nothing if inference leaves in clear text for an unbounded provider, and the best self-hosted model is worthless if one tenant's query can surface another tenant's documents.

1. Storage location and control

The first reflex is to verify where your documents, metadata, and vectors are physically stored. A vector database hosted in a European region, operated by an entity subject to EU law, is the minimum baseline. PostgreSQL with the pgvector extension, deployed on infrastructure you control, offers this guarantee while avoiding lock-in to a closed vector service: your embeddings live in the same database as your business data, under the same backup, encryption, and access regime.

On the hosting side, the European ecosystem has matured: OVHcloud and Scaleway offer compute and dedicated GPU in EU regions, Mistral AI runs its Plateforme from Europe, and Aleph Alpha specifically targets sovereign use cases in Germany. Depending on sensitivity, you can range from managed sovereign cloud to dedicated GPU, or full on-prem for the most critical data.

For indexing, pgvector offers two index families. HNSW (Hierarchical Navigable Small World) gives an excellent recall/latency trade-off and is tuned via three key parameters: m (connections per node, typically 16), ef_construction (build effort, 64–200), and ef_search (query effort, adjustable on the fly). The IVFFlat alternative builds faster and uses less memory but needs a representative volume to be effective. To control storage at scale, the halfvec type (16-bit instead of 32-bit) nearly halves the vector footprint with often negligible recall loss; binary quantization pushes the logic further for very large corpora.

-- Chunk table with embedding and tenant scope
CREATE TABLE document_chunks (
  id            bigserial PRIMARY KEY,
  organization_id bigint NOT NULL,
  document_id   bigint NOT NULL,
  content       text   NOT NULL,
  embedding     halfvec(1536) NOT NULL,   -- 16-bit to reduce storage
  metadata      jsonb  NOT NULL DEFAULT '{}'
);

-- HNSW index (cosine) — tune ef_search at query time for recall/latency
CREATE INDEX ON document_chunks
  USING hnsw (embedding halfvec_cosine_ops)
  WITH (m = 16, ef_construction = 100);

-- Partial index per organization for isolation (see pillar 3)
CREATE INDEX ON document_chunks (organization_id);

2. Control over inference flows

Sovereign storage is useless if every query ships the full context to a non-EU model provider without safeguards. Three strategies coexist, and most mature architectures combine them.

  • EU-hosted models: use providers offering European endpoints with contractual non-retention and non-training commitments. Mistral's Plateforme and the OVHcloud or Scaleway offerings around open models fall into this category. You gain operational simplicity while keeping the data under European law.
  • Self-hosted models: deploy open models (Llama 4, the Mistral family, Qwen 3, DeepSeek for reasoning, Gemma 3 for smaller footprints) on your own GPU infrastructure for the most sensitive cases. This is the most MLOps-demanding option, but the only one guaranteeing that not a byte of a prompt leaves your perimeter. A "Small"-class model is often enough to synthesize already-retrieved passages.
  • Hybrid approach: route queries by sensitivity level. Critical data stays on internal infrastructure; harmless queries (public FAQ, already-online documentation) can benefit from more capable external models.

Sensitivity routing deserves to be formalized rather than improvised. A simple decision flow:

Incoming query
   │
   ├─ Contains personal data / trade secrets?
   │     ├─ YES → self-hosted model (dedicated GPU, EU)
   │     └─ NO  ↓
   │
   ├─ Confidential but non-personal internal corpus?
   │     └─ YES → open model hosted in EU (non-retention)
   │
   └─ Public / low-sensitivity content?
         └─ Capable external API (if the quality trade-off warrants it)

Sensitivity classification ideally relies on metadata carried by each chunk (confidentiality level, presence of personal data detected at ingestion), so the routing decision is deterministic and auditable rather than left to a heuristic.

3. Multi-tenant isolation

In a B2B context, isolation between customers or departments is non-negotiable. Every search query must be scoped by organization identifier, enforced at the data-access layer — not just in the UI. A cross-tenant leak is not a minor bug: it is a major security incident, potentially a personal data breach under GDPR, with a 72-hour notification obligation.

Concretely, the tenant filter must be part of the similarity query itself, never an application-side post-filter that would let wrong-tenant candidates leak into the top-k before sorting:

-- The organization scope is IN the vector query, not after it
SELECT id, content, metadata
FROM document_chunks
WHERE organization_id = $1            -- non-negotiable isolation
ORDER BY embedding <=> $2             -- cosine distance
LIMIT 20;

Depending on the level of requirement, you can reinforce isolation with PostgreSQL Row-Level Security, per-tenant schemas, or even separate databases for the most regulated customers. The principle stands: isolation is a property of the infrastructure, verifiable by automated tests, not a developer discipline.

4. Traceability and auditability

Every generated answer must be traceable to its sources. This serves two goals: user trust (verifying where information comes from and spotting a hallucination) and compliance (documenting the processing). For each interaction, keep query logs, the IDs of retrieved documents and chunks, the relevance score, and the versions of both the embedding and generation models.

This logging is what turns an audit into routine rather than crisis. When a customer asks "what is this answer based on?" or a regulator demands the trace of a processing operation, you have the full chain. Beware, though: these logs themselves contain potentially personal data and must therefore be subject to an explicit retention policy (see GDPR checklist).

The document processing pipeline, step by step

A RAG system's quality is largely decided upstream of generation, in how documents are ingested and prepared. Here are the five essential steps.

  1. Extraction — Adapt extraction to the format: PDF (often the trickiest, between columns, tables, and scanned PDFs needing OCR), DOCX, web pages, business-tool exports. Sloppy extraction propagates noise through the whole pipeline.
  2. Cleaning — Remove headers, footers, repetitive legal notices, and layout artifacts. This is also the stage to detect and tag personal data, feeding sensitivity routing and any pseudonymization.
  3. Semantic chunking — Segment text along logical boundaries (sections, paragraphs) rather than a mechanical character count. Moderate overlap (roughly 10–20% of the chunk size) preserves context across boundaries. More advanced techniques sharpen the result further: late chunking (embedding the whole document before splitting, so each chunk inherits global context) and contextual retrieval (prefixing each chunk with a short LLM-generated context) markedly reduce meaning breaks.
  4. Vectorization — Compute embeddings, ideally asynchronously via a job queue so large volumes do not block ingestion. The embedding model matters: OpenAI text-embedding-3-large (up to 3072 dimensions), voyage-3, Cohere embed v3, or open and multilingual bge-m3 and Jina embeddings v3 — the latter especially relevant in a sovereign context where you want the embedding step under control too.
  5. Indexing — Store vectors with an HNSW similarity-search index, keeping metadata (source, date, permissions, sensitivity level) attached to each chunk. It is this metadata that makes filtering, isolation, and targeted erasure possible.

Chunking is the most underestimated step. Poor chunking sabotages relevance regardless of embedding-model quality — you never recover downstream from information split in the wrong place.

Beyond pure vector search: hybrid and reranking

Purely dense search (vector against vector) has a blind spot: it handles exact matches poorly — a reference number, a niche acronym, a rare proper noun. Hybrid search combines a lexical channel (BM25 / keywords) and a dense channel, then fuses the two rankings, most often via RRF (Reciprocal Rank Fusion). You get the best of both worlds: keyword robustness and vector semantic understanding.

A second layer often makes the difference: reranking. A cross-encoder (Cohere Rerank 3, or open bge-reranker-v2-m3, Jina reranker, Voyage rerank) finely re-scores the top 20–50 candidates before keeping only 3–5 for the context. It is a modest latency investment (a few tens of milliseconds) for an often dramatic precision gain, and a self-hosted open reranker remains fully compatible with a sovereign architecture. For visual documents (complex PDFs, diagrams), multi-vector approaches like ColBERT / ColPali open a complementary path.

GDPR compliance: the operational checklist

Beyond principles, here are the concrete points to validate before any production rollout.

  • Legal basis — Identify the legal basis (legitimate interest, consent, contractual performance) for each category of ingested data. Document it in your records of processing; it is the first question of any audit.
  • Minimization — Ingest only genuinely useful documents. A bloated corpus increases the risk surface without proportional relevance gains — and often degrades retrieval quality by drowning the good passages.
  • Right to erasure — Ensure that a document removed from the corpus also has its vectors and derived copies deleted. Deletion must cascade to the embeddings: a chunk deleted in the database but still present in the vector index or a cache is still live data. Test this cascade as you would a migration, because it is exactly the point a regulator will check.
  • Personal data in prompts — Detect and, if needed, pseudonymize personal data before sending it to an external model. The sensitivity routing described above is the natural complement to this measure.
  • Subprocessors — Maintain a register of subprocessors and their locations, and ensure every non-EU transfer is framed (standard contractual clauses, safeguard assessment). Each model or embedding provider is a subprocessor in its own right.
  • Retention — Define and enforce retention policies on conversation and query logs, including those you created for the traceability of pillar 4.

GDPR compliance is not a layer you add at the end: it is a property you design from the data schema. A clean document_id linking chunk, vector, and source makes erasure trivial; its absence makes erasure impossible.

The knowledge audit: the too-often-skipped prerequisite

Before vectorizing anything, take the time to map your informational heritage. This knowledge audit answers simple but decisive questions:

  • Which documents are authoritative, and which are obsolete? A repealed-but-still-indexed procedure will produce wrong answers with total confidence.
  • Where are the gaps users will inevitably hit?
  • Which content holds sensitive data needing special handling (routing, pseudonymization, outright exclusion)?
  • Who owns each source, and how often is it updated?

A RAG system amplifies the quality — or mediocrity — of the corpus you feed it. Injecting contradictory, outdated documentation produces an assistant that confidently answers... inaccuracies. This is the most insidious trap: a bad assistant that says "I don't know" is harmless; a bad assistant that invents with confidence destroys user trust in a handful of interactions. Detecting "knowledge gaps" during use then lets you enrich the corpus where real needs surface, rather than documenting everything up front.

Measuring what truly matters

A RAG project cannot be steered on intuition. A few structuring indicators:

  • Sourced-answer rate — Share of answers backed by at least one verifiable source. Below a certain threshold, the assistant hallucinates more than it informs.
  • Retrieval relevance — Do the right documents surface in the top results? Measure it on a golden set of reference questions, with metrics like context precision / context recall.
  • Perceived latency — Does response time stay compatible with fluid conversational use? Reranking, routing, and self-hosting all carry a cost to monitor end to end.
  • Detected-gap rate — Frequency of queries for which the corpus holds no satisfactory answer. This is your enrichment roadmap.

To go beyond intuition, frameworks like RAGAS automate evaluation along faithfulness (is the answer faithful to the sources?), answer relevancy, context precision, and context recall, complemented by LLM-as-judge approaches. The point is not to chase perfect scores but to have a stable dashboard: without a baseline, any pipeline "improvement" is merely an opinion.

These metrics turn an opaque "magic" assistant into a steerable information system whose progress you can justify to management — and whose compliance you can justify to a regulator.

Conclusion: sovereignty as a lasting advantage

Building a sovereign RAG architecture takes more upfront effort than wiring directly to a generalist API. But it pays off fast: it turns an imposed dependency into a controlled asset. You keep control of your knowledge, you document your compliance without panic at every audit, and you build lasting trust with customers and regulators alike.

The good news is that the technical building blocks are mature today. PostgreSQL and pgvector with an HNSW index for storage, capable open models (Llama 4, Mistral, Qwen 3, DeepSeek) for sensitive inference, credible European hosts, hybrid search with reranking for quality, proven ingestion pipelines: everything exists to build a RAG that is both performant and sovereign. All that remains is to assemble it methodically — starting by mapping what you actually know.

Further reading

  • AI sovereignty for the European enterprise: the 2026 strategic guide
  • GDPR and generative AI: the compliance guide for your RAG projects
  • Production RAG architecture: from chunking to reranking, the complete guide

Ready to ground your agents in your data?

Start free. First Knowledge Pulse audit in 60 seconds.

Start free