GDPR and sovereignty: how your data stays in Europe with RagNight
AI agents managed by major US hyperscalers raise compliance issues that many CIOs prefer to ignore. Let's set things straight.
Anonymize or pseudonymize before vectorizing? The GDPR distinction, why perfect anonymization is rare on free text, the techniques (NER, consistent pseudonymization), and the reflexes to adopt.
Before a document enters a RAG pipeline, it passes through a step with heavy consequences: vectorization. From that moment, its content — including any personal data it holds — is duplicated into embeddings, indexed, and potentially sent to external models as queries flow. Hence a question too many teams defer: should you anonymize or pseudonymize personal data before vectorizing? And how far is that actually possible?
This article surveys the techniques, their limits, and the reflexes to adopt — without selling you a compliance promise the technology cannot keep.
The two terms are often confused. GDPR separates them clearly: anonymization transforms data irreversibly, with no reasonable way to re-identify a person — truly anonymous data falls outside GDPR. Pseudonymization replaces direct identifiers (with a token, a code), but re-identification remains possible via separately held information — pseudonymized data remains personal data and stays under GDPR.
The boundary is not a vocabulary detail: it decides whether you still carry obligations (legal basis, retention, data-subject rights, processing register) or not. And the official bar is high. For data to count as anonymous, the European reference authority (the WP29, now the EDPB) requires that none of three risks remain: singling out (isolating an individual in the dataset), linkability (connecting two records to the same person) and inference (deducing information about a person with high probability). On free text, eliminating all three at once is extremely hard.
Total, irreversible anonymization is a rarely achieved grail on rich text. In practice you mostly do pseudonymization — and must never pretend otherwise, neither internally nor to an auditor.
Anonymizing an "email" column is simple. Anonymizing free text — a report, a support ticket, a contract — is another matter:
The danger is rarely the explicit name you masked — it is everything left around it. A few cases seen in production:
The central dilemma: the more you protect, the less utility you keep — and vice versa.
Automatically spot names, emails, phones, addresses, numbers (IBAN, social-security, license plates). It is the base layer, powered by specialized models (spaCy, Microsoft Presidio, fine-tuned transformers, or an LLM in extraction mode). Essential but imperfect: it trips on domain jargon, typos, unusual formats, rare or foreign names, and above all on indirect identifiers (a role, a location) that no entity detector flags as "personal data." Expect high but never total recall: plan a safety net (human review on sensitive corpora, domain pattern lists).
Replace the entity with a fixed marker ([NAME], [EMAIL]). Simple and robust, but it breaks consistency: two mentions of the same name become indistinguishable. The text "[NAME] forwarded the file to [NAME], who approved it" becomes unreadable to a model — impossible to tell whether it is one person or two. Acceptable for plain log scrubbing, poor for conversational RAG.
Replace each entity with a stable, unique token (PERSON_42), so that every occurrence of the same person gets the same token across the whole corpus. This is what preserves relationships — hence answer utility. Before/after example:
# BEFORE
Marie Dupont (marie.dupont@acme.fr), legal director, negotiated the
contract with Acme Corp then had it signed by Jean Martin, her deputy.
Marie Dupont remains the point of contact for the case.
# AFTER (consistent pseudonymization)
PERSON_1 (EMAIL_1), ROLE_1, negotiated the contract with ORG_1 then
had it signed by PERSON_2, her deputy. PERSON_1 remains the point of
contact for the case.
The model keeps the structuring information: there are two distinct people, a reporting relationship, an organization, and PERSON_1 does appear twice. A question like "who is the contact for the Acme case?" returns PERSON_1, which the application can re-substitute with "Marie Dupont" at display time if and only if the user is authorized — via the mapping vault described below.
Replace a precise value with a category: "34 years old" → "30-40," "March 12, 2024" → "Q1 2024," "Lyon" → "a major French city." Reduces the singling-out risk at the cost of granularity. It is the ideal tool against the quasi-identifiers seen above, but dose it carefully: generalizing "CFO of the Lyon subsidiary" into "senior executive" can strip the sentence of all analytical value.
Keep the table linking each token (PERSON_42) to its real value ("Marie Dupont") in a separate, encrypted store under strict access control, with its own retention and its own logging. This vault is what makes pseudonymization reversible — and its very existence is what keeps the data within GDPR scope.
For RAG, consistent pseudonymization is often the best compromise: it preserves relationships between entities (hence answer utility) while removing direct identifiers from embeddings and prompts. Generalization complements it, neutralizing the quasi-identifiers NER cannot see.
There is no universal setting: the right dose depends on corpus and use. Take one passage at three treatment levels:
The common mistake is to believe "more masking = better." On a legal or medical corpus, over-anonymizing destroys the very value that justified the RAG project. The right approach: classify data by sensitivity, and apply the heaviest treatment only where it is warranted. Sometimes the best answer is not to anonymize more but to restrict access and control hosting.
We often secure ingestion while forgetting the other end of the chain. At answer time, a RAG retrieves chunks and pastes them into a prompt sent to the generation model. If that model is a non-EU API, you have just exported personal data — even if your vector store is perfectly hosted in Europe.
User query
→ vector search (pseudonymized chunks, or not?)
→ prompt assembly
→ LLM API CALL ← leak point if non-EU and chunks not pseudonymized
→ answer
Two practical consequences. First, if your chunks still contain personal data, every query that surfaces them transmits them to the LLM provider, with all the cross-border-transfer and reuse questions that raises. Second, this is exactly where consistent pseudonymization pays off: by pseudonymizing before vectorization, the retrieved chunks carry only tokens, the external LLM never sees an identity, and re-substitution happens application-side, after the answer, under authorization control. The rule is simple: nothing identifying should cross the boundary of the controlled perimeter.
Anonymizing before vectorization is a good intention that meets a technical reality: on rich text, perfect anonymization is rare, and consistent pseudonymization is the realistic compromise. The key is not to lie to yourself about the real level of protection, to address context-based re-identification as much as direct identifiers, to systematically pseudonymize what leaves your perimeter, and to cover the rest with access control and sovereign hosting. Protecting personal data in a RAG is not a checkbox: it is a chain of decisions, from ingestion to inference.
AI agents managed by major US hyperscalers raise compliance issues that many CIOs prefer to ignore. Let's set things straight.
The GDPR compliance guide for your RAG projects: legal basis, minimization, personal data in embeddings and prompts, the right to erasure cascading into vectors, subprocessors, and articulation with the AI Act.
The real EU AI Act timeline (2025-2027), risk tiers, where an enterprise RAG assistant lands, and a concrete compliance checklist — jargon-free, for DPOs and product teams.
Start free. First Knowledge Pulse audit in 60 seconds.
Start free