Sovereignty & Compliance

GDPR and generative AI: the compliance guide for your RAG projects

Alexia · 18 min read · February 10, 2026

The GDPR compliance guide for your RAG projects: legal basis, minimization, personal data in embeddings and prompts, the right to erasure cascading into vectors, subprocessors, and articulation with the AI Act.

Wiring a RAG assistant to a company's internal documentation almost always means handling personal data: names in reports, customer details in tickets, identities in contracts, signatures at the bottom of emails. GDPR does not stop at your AI project's door — on the contrary, generative AI amplifies the stakes, because it duplicates, vectorizes, and sometimes exfiltrates that data to external models. The good news: a well-built RAG is also a compliant RAG. Compliance is not a legal layer bolted on afterward — it is an architectural property decided at ingestion time, not at the moment the regulator comes knocking.

This guide walks, point by point, through GDPR compliance for a RAG project — from choosing the legal basis to erasure cascading into embeddings, by way of the DPIA, the subprocessor register, and the articulation with the EU AI Act. The goal is not to turn you into a lawyer, but to give you the engineering decisions that make a system defensible.

What personal data lives in a RAG?

First reflex: map where personal data lives in your pipeline. It is everywhere, often where you don't expect it:

in ingested documents (meeting reports, contracts, emails, support tickets, HR files);
in embeddings — a vector derives from source text; it is not magically anonymous;
in prompts sent to the model, which contain user questions and the retrieved context;
in logs (conversation histories, query logs, observability traces);
in the metadata attached to chunks (document author, date, originating department);
in the user's own data (who asks what, and when).

Take a concrete case. A company indexes its annual review notes so that an assistant can help managers prepare appraisals. Each document contains the employee's name, assessments, sometimes health information (a sick leave mentioned), compensation elements. On ingestion, that document becomes thirty-odd chunks, each vectorized into a 1536- or 3072-dimensional embedding, stored in an index. A manager's question ("What were Jean Dupont's areas for improvement last year?") becomes a prompt that contains the name and pulls back the most sensitive passages of the file. On a single query you have processing of potentially sensitive data, disclosure to a model, and a persistent trace. If that model is a non-EU API, the data has traveled.

A classic error: believing vectorization "anonymizes" data. An embedding is a transformation of content, not a deletion. Research has shown that a substantial portion of the original text can be reconstructed from its vectors alone (embedding inversion attacks). If it derives from personal data, the embedding remains subject to GDPR.

Until this mapping is done, everything else is flying blind. It is also what will feed your record of processing activities and, where relevant, your DPIA.

Legal basis: the mandatory starting point

No processing of personal data is lawful without a legal basis. That is Article 6 of the GDPR, and it admits no exception: no legal basis, no processing. For an enterprise RAG, the relevant bases are in practice three or four.

Legitimate interest and its balancing test

This is the basis most often chosen for an internal productivity tool. But legitimate interest is not a blank check: it requires a balancing test in three steps, which must be documented in writing.

The purpose test: is the interest pursued legitimate? "Improving team productivity by making internal knowledge easily searchable" is an acceptable legitimate interest.
The necessity test: is the processing necessary for that purpose? Indexing entire HR files to answer questions about internal procedures is not — less intrusive means exist. This is where minimization meets the legal basis.
The balancing itself: do the company's interests override the rights and freedoms of the data subjects? You weigh people's reasonable expectations, the existence of safeguards (pseudonymization, access restriction, right to object), and the sensitivity of the data.

Concretely: an assistant querying the technical document base and internal procedures passes the test easily. An assistant that vacuums up every employee's personal emails to "know everything" fails the balancing — an employee's reasonable expectation is not that their exchanges be vectorized and exposed to colleagues through a chatbot.

The special case of the employee

Consent is tempting but treacherous in an employment context. The subordination relationship weakens its validity: an employee is not in a position to refuse freely when the employer asks. The EDPB and national authorities are consistent on this. For an internal tool, prefer legitimate interest (for a convenience tool) or performance of the employment contract / legal obligation (for processing genuinely necessary to the working relationship). Reserve consent for cases where it is truly free and revocable without consequence.

Contract performance and legal obligation

An assistant processing data strictly necessary to provide a contractual service (for example answering customers from their own files) can rely on contract performance. Legal obligation covers rarer cases (retention of accounting documents, for instance).

Identify the legal basis for each category of ingested data, not for "the RAG project" as a block. Technical documentation often falls under legitimate interest; customer files, under contract performance; sensitive HR data may require a separate basis, or be excluded from the corpus altogether. This is the foundation: everything else follows.

Minimization: less is more

In practice — try Ragnight on sovereign, GDPR-compliant infrastructure.

Try free

The minimization principle (Article 5(1)(c)) is also good engineering. Ingest only documents genuinely useful for the intended use. A bloated corpus:

increases the risk surface (more personal data exposed, more targets in a breach);
often degrades relevance (noise, duplicates, contradictions, stale documents surfacing in results);
burdens compliance (more to document, protect, and erase).

In practice, minimization breaks down into several concrete levers:

Filtering at the source: exclude up front the spaces that have no business in an assistant (HR files, sensitive legal, personal mailboxes). An explicit allow-list beats a deny-list.
Dropping useless fields: if you index tickets, do you need the customer's phone number in the chunk, or only the problem description?
Redaction at ingestion: strip or mask direct identifiers that don't serve the purpose (see pseudonymization, below).
Freshness: don't index stale versions that multiply copies of outdated personal data.

Minimization serves two masters at once: GDPR compliance and RAG quality. A tight, relevant corpus is safer and more performant. The noise you remove for compliance is the same noise that was degrading your retrieval results.

The right to erasure: the cascade into embeddings

The most often botched requirement, technically. When someone exercises their right to erasure (Article 17), deleting the source document is not enough. The information also lives in:

the chunks derived from the document;
the embeddings of those chunks;
the vector indexes (which may keep a reference even after the row is deleted);
any copies: result caches, backups, conversation logs that pulled the passage back.

A data model that makes erasure possible

The technical precondition for erasure is lineage traceability: each embedding must know which chunk it came from, and each chunk which document. A simple relational schema suffices, as long as the foreign keys are laid down from the start. With PostgreSQL and pgvector, it looks like:

CREATE TABLE documents (
  id           BIGSERIAL PRIMARY KEY,
  source_ref   TEXT NOT NULL,        -- identifier on the source side (DMS, ticket…)
  subject_id   TEXT,                 -- data subject, if identifiable
  created_at   TIMESTAMPTZ DEFAULT now()
);

CREATE TABLE document_chunks (
  id           BIGSERIAL PRIMARY KEY,
  document_id  BIGINT NOT NULL REFERENCES documents(id) ON DELETE CASCADE,
  content      TEXT NOT NULL,
  embedding    vector(1536)          -- HNSW index on top
);

CREATE INDEX ON document_chunks USING hnsw (embedding vector_cosine_ops);

The ON DELETE CASCADE is the detail that changes everything: deleting a documents row automatically purges its chunks and their embeddings, and the HNSW index is updated accordingly. Erasing a data subject then becomes a deterministic query:

-- Erase all documents tied to a data subject
DELETE FROM documents WHERE subject_id = $1;
-- chunks + embeddings cascaded, index cleaned

If you store your vectors in a separate engine rather than in PostgreSQL, you lose the free CASCADE: you must then propagate the deletion yourself, by document identifier, and make sure the engine actually removes the vector from the index (and not merely marks it deleted pending a compaction). This is one argument for keeping vector storage inside the transactional database.

Proving erasure

GDPR asks you not only to erase but to be able to prove it (accountability, Article 5(2)). Log every erasure in a dedicated journal: who requested, when, on which subject_id, how many documents/chunks deleted, and confirmation that caches and backups were purged. Mind the paradox: this proof journal must not itself reintroduce the erased data — record technical identifiers and a timestamp, not the deleted content.

Don't forget the blind spots: backups (define an expiry/rotation policy rather than keeping backups full of erased data forever), response caches, and conversation histories where a snippet of the document may have been served to a user.

Erasure requested
   ─► delete the source document
   ─► delete the attached chunks (CASCADE)
   ─► delete the corresponding embeddings + index
   ─► purge response caches and memorized results
   ─► schedule expiry in backups
   ─► log the operation in the proof journal

Design erasure from the start. Retrofitting it onto an architecture where embeddings aren't linked to their source is painful — sometimes you must re-vectorize the whole corpus to rebuild the lineage — and exposes you to costly breaches.

Personal data in prompts

Each query sends the model the user's question and the retrieved context, which may contain personal data. If the model is a non-EU API, that data leaves your perimeter at the precise moment of inference — the most frequent and most underestimated leak point. Two complementary defenses.

Detection and pseudonymization before sending

Before sending the context to an external model, detect personal entities (names, emails, numbers, IBANs…) and replace them with stable tokens. "Stable" is the key word: the same name must yield the same token throughout the prompt, or the model loses the thread. Example:

Raw prompt:
"Jean Dupont (jean.dupont@acme.fr) reported a bug on 03/12.
 Follow-up requested by Marie Durand."

Pseudonymized prompt sent to the model:
"[PERSON_1] ([EMAIL_1]) reported a bug on 03/12.
 Follow-up requested by [PERSON_2]."

On receipt, you re-substitute the tokens with the real values before displaying the answer to the authorized user. The external model never saw the identities. You keep a token → value mapping on the EU side, in session memory, never persisted at the subprocessor.

This technique has limits: it does not make data anonymous under GDPR (the RAG can still re-identify, so this is pseudonymization, which remains personal-data processing), it can fail on poorly detected entities or indirectly identifying data ("the CFO of the Lyon subsidiary"), and it sometimes degrades answer quality. But it sharply reduces exposure at the most common leak point.

Routing by sensitivity

Not all queries are equal. Set up routing: queries whose context contains sensitive data (health, opinions, HR data) are processed by an EU or self-hosted model (Mistral on La Plateforme EU, an open model like Llama 4 or Mistral Small 3 on OVHcloud or Scaleway); benign queries can go to a more capable external API. Routing rests on classifying the retrieved context, not just the question — it is the context that carries the risk.

Subprocessors and non-EU transfers

Every vendor processing your data on your behalf is a subprocessor under GDPR (Article 28). The scope is broader than people think.

Map the whole chain

The reflex is to think only of the LLM. Yet a RAG's full chain often includes:

the generation model provider (OpenAI, Anthropic, Mistral, Google…);
the embeddings provider (which sees your texts at vectorization time);
the hosting of the vector database and the application;
the observability / tracing tool (which may log entire prompts, hence personal data);
upstream detection / OCR / parsing services for documents.

Each must appear in your subprocessor register, with its location, the nature of the data processed, and the signed DPA (Data Processing Agreement). An observability tool that captures prompts in clear text and stores them in the US is a non-EU transfer many teams overlook.

Frame non-EU transfers

For any transfer outside the European Economic Area, you need an appropriate safeguard (Chapter V of the GDPR):

an adequacy decision (for example the EU-US Data Privacy Framework for certified US providers — check the certification, it is not automatic);
failing that, the European Commission's standard contractual clauses (SCCs), supplemented by a transfer impact assessment (TIA) post-Schrems II evaluating whether the destination country's law offers equivalent protection, and supplementary measures (encryption, pseudonymization) where needed.

Also verify non-retention and no-reuse-for-training commitments: most "enterprise" offerings from major providers commit not to train on your data and not to keep it beyond a short anti-abuse window — but consumer offerings do not. The line between the two is a classic source of leakage.

Your model provider, your vector hosting, your observability tool, your OCR: all can be subprocessors. Map the full chain, not just the LLM, and keep the register current with every new tool wired into the pipeline.

The DPIA: when and how

The Data Protection Impact Assessment (DPIA — Article 35) is mandatory when processing is likely to result in a high risk to individuals' rights and freedoms.

When is it required?

The GDPR and EDPB guidelines set criteria. A DPIA is required notably for large-scale evaluation/scoring, large-scale processing of sensitive data, systematic monitoring, or matching/combining datasets. National authorities also publish lists of processing that mandate a DPIA. For a RAG, several situations typically trigger it:

large-scale indexing of HR, health, or other sensitive data;
an assistant that produces some form of evaluation of individuals (performance summaries, candidate scoring);
cross-referencing several sources that, combined, become strongly identifying.

Conversely, an assistant querying only technical documentation with no significant personal data generally does not trigger a DPIA — but document that reasoning (the "why not" is part of accountability).

What it contains

A well-made DPIA describes:

The processing: purposes, data categories, flows (ingestion → vectorization → retrieval → generation → logs), subprocessors and transfers.
Necessity and proportionality: legal basis, minimization, retention periods, information to data subjects.
The risks to individuals: unauthorized access, re-identification via embeddings, leakage via the external model, serving data to an unauthorized user (a RAG without chunk-level access control can "answer" with data a user should not see).
The measures: pseudonymization, routing by sensitivity, chunk-level access control, cascading erasure, encryption, logging.

The DPIA is a living document: update it when the architecture changes (new model, new data source, new subprocessor). It is also your best tool for dialogue with the DPO.

Retention

Conversation logs and query histories contain personal data (the question itself, the retrieved context, sometimes the user's identity) and must not pile up indefinitely (Article 5(1)(e)). Define and enforce concrete retention policies. Some reasonable orders of magnitude, to adapt to your context:

User conversation histories: short retention (say 30 to 90 days) for usability, then automatic purge or anonymization.
Technical logs / observability traces containing prompts: minimal retention, ideally with prompt pseudonymization at capture, and purge within a few weeks.
Indexed documents: duration aligned with the source document's retention in its system of origin — a document deleted at the source must be deleted in the RAG.
Backups: rotation with expiry, so erased data does not survive indefinitely in a backup.

Implementation relies on automatic purges (scheduled jobs) rather than manual deletions you forget. Data you no longer keep is data you no longer have to protect, no longer have to erase on request, and that will not leak.

Articulation with the EU AI Act

GDPR and the EU AI Act stack: they do not replace one another. The first protects personal data (legal basis, minimization, data subject rights, DPIA); the second governs the AI system itself (risk-tier classification, technical documentation, transparency, human oversight).

To situate the timeline: the AI Act entered into force on 1 August 2024; prohibited practices and AI-literacy obligations have applied since February 2025; general-purpose AI (GPAI) model obligations since August 2025; the bulk of high-risk system obligations applies in August 2026, with some in 2027. The approach is by risk tier: unacceptable, high, limited (transparency), minimal.

Most internal RAG assistants fall under limited transparency: you must then inform users they are interacting with an AI and, where relevant, mark generated content. But a RAG that fed into a high-risk decision (screening job applications, evaluating access to an essential service) tips into the "high-risk" category and inherits far heavier obligations: risk management, data quality, technical documentation, logging, human oversight.

In practice, the two regimes feed each other. The data mapping in your DPIA feeds the AI Act's technical documentation; the AI Act's risk analysis overlaps with the DPIA's risk-to-individuals analysis. But one compliance does not stand in for the other: having done a DPIA does not exempt you from AI Act transparency obligations, and vice versa.

RAG compliance checklist

[ ] Mapping of personal data across the pipeline (docs, chunks, metadata, embeddings, prompts, logs, user data)
[ ] Legal basis identified and documented for each category of ingested data; written balancing test if legitimate interest
[ ] Employee consent avoided in favor of legitimate interest or contract performance
[ ] Minimization: explicit allow-list, exclusion of sensitive spaces, redaction at ingestion
[ ] Cascading erasure: embeddings tied to source (ON DELETE CASCADE or explicit propagation), index and cache purge
[ ] Proof of erasure: dedicated journal, backups and histories accounted for
[ ] Pseudonymization before any send to an external model (stable tokens, EU-side re-substitution)
[ ] Routing by sensitivity to an EU / self-hosted model for sensitive queries
[ ] Chunk-level access control: a user retrieves only what they are allowed to see
[ ] Subprocessor register complete (model, embeddings, hosting, observability, OCR) with signed DPAs
[ ] Non-EU transfers framed (adequacy / SCC + TIA + supplementary measures); non-retention and no-training commitments verified
[ ] Retention periods defined and enforced via automatic purge (histories, logs, backups)
[ ] DPIA done where required, and kept up to date
[ ] AI Act transparency: user information, generated-content marking; risk-tier classification
[ ] Traceability: logging of access and processing

Conclusion

GDPR compliance for a RAG is not a legal layer added at the end: it is a series of architectural decisions, from ingestion to inference. Identify your legal basis and write your balancing test, minimize the corpus, tie each embedding to its source to enable erasure and prove it, pseudonymize what leaves and route by sensitivity, map and frame your subprocessors, run a DPIA when the risk demands it, and articulate it all with the AI Act — each choice protects your users and strengthens your system. GDPR's least-shared secret is that a compliant architecture is almost always a better architecture — more controlled, more traceable, more trustworthy. Compliance is not the price of enterprise AI: it is its condition for durability.

Sovereign, GDPR-compliant AI, starting today

Take back control of your knowledge: EU-hosted infrastructure, no data sent to LLMs without your consent.

Start free See features