GDPR and sovereignty: how your data stays in Europe with RagNight
AI agents managed by major US hyperscalers raise compliance issues that many CIOs prefer to ignore. Let's set things straight.
The GDPR compliance guide for your RAG projects: legal basis, minimization, personal data in embeddings and prompts, the right to erasure cascading into vectors, subprocessors, and articulation with the AI Act.
Wiring a RAG assistant to a company's internal documentation almost always means handling personal data: names in reports, customer details in tickets, identities in contracts, signatures at the bottom of emails. GDPR does not stop at your AI project's door — on the contrary, generative AI amplifies the stakes, because it duplicates, vectorizes, and sometimes exfiltrates that data to external models. The good news: a well-built RAG is also a compliant RAG. Compliance is not a legal layer bolted on afterward — it is an architectural property decided at ingestion time, not at the moment the regulator comes knocking.
This guide walks, point by point, through GDPR compliance for a RAG project — from choosing the legal basis to erasure cascading into embeddings, by way of the DPIA, the subprocessor register, and the articulation with the EU AI Act. The goal is not to turn you into a lawyer, but to give you the engineering decisions that make a system defensible.
First reflex: map where personal data lives in your pipeline. It is everywhere, often where you don't expect it:
Take a concrete case. A company indexes its annual review notes so that an assistant can help managers prepare appraisals. Each document contains the employee's name, assessments, sometimes health information (a sick leave mentioned), compensation elements. On ingestion, that document becomes thirty-odd chunks, each vectorized into a 1536- or 3072-dimensional embedding, stored in an index. A manager's question ("What were Jean Dupont's areas for improvement last year?") becomes a prompt that contains the name and pulls back the most sensitive passages of the file. On a single query you have processing of potentially sensitive data, disclosure to a model, and a persistent trace. If that model is a non-EU API, the data has traveled.
A classic error: believing vectorization "anonymizes" data. An embedding is a transformation of content, not a deletion. Research has shown that a substantial portion of the original text can be reconstructed from its vectors alone (embedding inversion attacks). If it derives from personal data, the embedding remains subject to GDPR.
Until this mapping is done, everything else is flying blind. It is also what will feed your record of processing activities and, where relevant, your DPIA.
No processing of personal data is lawful without a legal basis. That is Article 6 of the GDPR, and it admits no exception: no legal basis, no processing. For an enterprise RAG, the relevant bases are in practice three or four.
This is the basis most often chosen for an internal productivity tool. But legitimate interest is not a blank check: it requires a balancing test in three steps, which must be documented in writing.
Concretely: an assistant querying the technical document base and internal procedures passes the test easily. An assistant that vacuums up every employee's personal emails to "know everything" fails the balancing — an employee's reasonable expectation is not that their exchanges be vectorized and exposed to colleagues through a chatbot.
Consent is tempting but treacherous in an employment context. The subordination relationship weakens its validity: an employee is not in a position to refuse freely when the employer asks. The EDPB and national authorities are consistent on this. For an internal tool, prefer legitimate interest (for a convenience tool) or performance of the employment contract / legal obligation (for processing genuinely necessary to the working relationship). Reserve consent for cases where it is truly free and revocable without consequence.
An assistant processing data strictly necessary to provide a contractual service (for example answering customers from their own files) can rely on contract performance. Legal obligation covers rarer cases (retention of accounting documents, for instance).
Identify the legal basis for each category of ingested data, not for "the RAG project" as a block. Technical documentation often falls under legitimate interest; customer files, under contract performance; sensitive HR data may require a separate basis, or be excluded from the corpus altogether. This is the foundation: everything else follows.
The minimization principle (Article 5(1)(c)) is also good engineering. Ingest only documents genuinely useful for the intended use. A bloated corpus:
In practice, minimization breaks down into several concrete levers:
Minimization serves two masters at once: GDPR compliance and RAG quality. A tight, relevant corpus is safer and more performant. The noise you remove for compliance is the same noise that was degrading your retrieval results.
The most often botched requirement, technically. When someone exercises their right to erasure (Article 17), deleting the source document is not enough. The information also lives in:
The technical precondition for erasure is lineage traceability: each embedding must know which chunk it came from, and each chunk which document. A simple relational schema suffices, as long as the foreign keys are laid down from the start. With PostgreSQL and pgvector, it looks like:
CREATE TABLE documents (
id BIGSERIAL PRIMARY KEY,
source_ref TEXT NOT NULL, -- identifier on the source side (DMS, ticket…)
subject_id TEXT, -- data subject, if identifiable
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE TABLE document_chunks (
id BIGSERIAL PRIMARY KEY,
document_id BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
content TEXT NOT NULL,
embedding vector(1536) -- HNSW index on top
);
CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops);
The ON DELETE CASCADE is the detail that changes everything: deleting a documents row automatically purges its chunks and their embeddings, and the HNSW index is updated accordingly. Erasing a data subject then becomes a deterministic query:
-- Erase all documents tied to a data subject
DELETE FROM documents WHERE subject_id = $1;
-- chunks + embeddings cascaded, index cleaned
If you store your vectors in a separate engine rather than in PostgreSQL, you lose the free CASCADE: you must then propagate the deletion yourself, by document identifier, and make sure the engine actually removes the vector from the index (and not merely marks it deleted pending a compaction). This is one argument for keeping vector storage inside the transactional database.
GDPR asks you not only to erase but to be able to prove it (accountability, Article 5(2)). Log every erasure in a dedicated journal: who requested, when, on which subject_id, how many documents/chunks deleted, and confirmation that caches and backups were purged. Mind the paradox: this proof journal must not itself reintroduce the erased data — record technical identifiers and a timestamp, not the deleted content.
Don't forget the blind spots: backups (define an expiry/rotation policy rather than keeping backups full of erased data forever), response caches, and conversation histories where a snippet of the document may have been served to a user.
Erasure requested
─► delete the source document
─► delete the attached chunks (CASCADE)
─► delete the corresponding embeddings + index
─► purge response caches and memorized results
─► schedule expiry in backups
─► log the operation in the proof journal
Design erasure from the start. Retrofitting it onto an architecture where embeddings aren't linked to their source is painful — sometimes you must re-vectorize the whole corpus to rebuild the lineage — and exposes you to costly breaches.
Each query sends the model the user's question and the retrieved context, which may contain personal data. If the model is a non-EU API, that data leaves your perimeter at the precise moment of inference — the most frequent and most underestimated leak point. Two complementary defenses.
Before sending the context to an external model, detect personal entities (names, emails, numbers, IBANs…) and replace them with stable tokens. "Stable" is the key word: the same name must yield the same token throughout the prompt, or the model loses the thread. Example:
Raw prompt:
"Jean Dupont (jean.dupont@acme.fr) reported a bug on 03/12.
Follow-up requested by Marie Durand."
Pseudonymized prompt sent to the model:
"[PERSON_1] ([EMAIL_1]) reported a bug on 03/12.
Follow-up requested by [PERSON_2]."
On receipt, you re-substitute the tokens with the real values before displaying the answer to the authorized user. The external model never saw the identities. You keep a token → value mapping on the EU side, in session memory, never persisted at the subprocessor.
This technique has limits: it does not make data anonymous under GDPR (the RAG can still re-identify, so this is pseudonymization, which remains personal-data processing), it can fail on poorly detected entities or indirectly identifying data ("the CFO of the Lyon subsidiary"), and it sometimes degrades answer quality. But it sharply reduces exposure at the most common leak point.
Not all queries are equal. Set up routing: queries whose context contains sensitive data (health, opinions, HR data) are processed by an EU or self-hosted model (Mistral on La Plateforme EU, an open model like Llama 4 or Mistral Small 3 on OVHcloud or Scaleway); benign queries can go to a more capable external API. Routing rests on classifying the retrieved context, not just the question — it is the context that carries the risk.
Every vendor processing your data on your behalf is a subprocessor under GDPR (Article 28). The scope is broader than people think.
The reflex is to think only of the LLM. Yet a RAG's full chain often includes:
Each must appear in your subprocessor register, with its location, the nature of the data processed, and the signed DPA (Data Processing Agreement). An observability tool that captures prompts in clear text and stores them in the US is a non-EU transfer many teams overlook.
For any transfer outside the European Economic Area, you need an appropriate safeguard (Chapter V of the GDPR):
Also verify non-retention and no-reuse-for-training commitments: most "enterprise" offerings from major providers commit not to train on your data and not to keep it beyond a short anti-abuse window — but consumer offerings do not. The line between the two is a classic source of leakage.
Your model provider, your vector hosting, your observability tool, your OCR: all can be subprocessors. Map the full chain, not just the LLM, and keep the register current with every new tool wired into the pipeline.
The Data Protection Impact Assessment (DPIA — Article 35) is mandatory when processing is likely to result in a high risk to individuals' rights and freedoms.
The GDPR and EDPB guidelines set criteria. A DPIA is required notably for large-scale evaluation/scoring, large-scale processing of sensitive data, systematic monitoring, or matching/combining datasets. National authorities also publish lists of processing that mandate a DPIA. For a RAG, several situations typically trigger it:
Conversely, an assistant querying only technical documentation with no significant personal data generally does not trigger a DPIA — but document that reasoning (the "why not" is part of accountability).
A well-made DPIA describes:
The DPIA is a living document: update it when the architecture changes (new model, new data source, new subprocessor). It is also your best tool for dialogue with the DPO.
Conversation logs and query histories contain personal data (the question itself, the retrieved context, sometimes the user's identity) and must not pile up indefinitely (Article 5(1)(e)). Define and enforce concrete retention policies. Some reasonable orders of magnitude, to adapt to your context:
Implementation relies on automatic purges (scheduled jobs) rather than manual deletions you forget. Data you no longer keep is data you no longer have to protect, no longer have to erase on request, and that will not leak.
GDPR and the EU AI Act stack: they do not replace one another. The first protects personal data (legal basis, minimization, data subject rights, DPIA); the second governs the AI system itself (risk-tier classification, technical documentation, transparency, human oversight).
To situate the timeline: the AI Act entered into force on 1 August 2024; prohibited practices and AI-literacy obligations have applied since February 2025; general-purpose AI (GPAI) model obligations since August 2025; the bulk of high-risk system obligations applies in August 2026, with some in 2027. The approach is by risk tier: unacceptable, high, limited (transparency), minimal.
Most internal RAG assistants fall under limited transparency: you must then inform users they are interacting with an AI and, where relevant, mark generated content. But a RAG that fed into a high-risk decision (screening job applications, evaluating access to an essential service) tips into the "high-risk" category and inherits far heavier obligations: risk management, data quality, technical documentation, logging, human oversight.
In practice, the two regimes feed each other. The data mapping in your DPIA feeds the AI Act's technical documentation; the AI Act's risk analysis overlaps with the DPIA's risk-to-individuals analysis. But one compliance does not stand in for the other: having done a DPIA does not exempt you from AI Act transparency obligations, and vice versa.
ON DELETE CASCADE or explicit propagation), index and cache purgeGDPR compliance for a RAG is not a legal layer added at the end: it is a series of architectural decisions, from ingestion to inference. Identify your legal basis and write your balancing test, minimize the corpus, tie each embedding to its source to enable erasure and prove it, pseudonymize what leaves and route by sensitivity, map and frame your subprocessors, run a DPIA when the risk demands it, and articulate it all with the AI Act — each choice protects your users and strengthens your system. GDPR's least-shared secret is that a compliant architecture is almost always a better architecture — more controlled, more traceable, more trustworthy. Compliance is not the price of enterprise AI: it is its condition for durability.
AI agents managed by major US hyperscalers raise compliance issues that many CIOs prefer to ignore. Let's set things straight.
Anonymize or pseudonymize before vectorizing? The GDPR distinction, why perfect anonymization is rare on free text, the techniques (NER, consistent pseudonymization), and the reflexes to adopt.
The real EU AI Act timeline (2025-2027), risk tiers, where an enterprise RAG assistant lands, and a concrete compliance checklist — jargon-free, for DPOs and product teams.
Start free. First Knowledge Pulse audit in 60 seconds.
Start free