All posts Security & GDPR

Anonymization before vectorization: techniques, limits, and good reflexes

L'équipe RagNight · 10 min read · April 21, 2026

Anonymize or pseudonymize before vectorizing? The GDPR distinction, why perfect anonymization is rare on free text, the techniques (NER, consistent pseudonymization), and the reflexes to adopt.

Before a document enters a RAG pipeline, it passes through a step with heavy consequences: vectorization. From that moment, its content — including any personal data it holds — is duplicated into embeddings, indexed, and potentially sent to external models as queries flow. Hence a question too many teams defer: should you anonymize or pseudonymize personal data before vectorizing? And how far is that actually possible?

This article surveys the techniques, their limits, and the reflexes to adopt — without selling you a compliance promise the technology cannot keep.

Anonymization vs pseudonymization: a distinction that changes everything

The two terms are often confused. GDPR separates them clearly: anonymization transforms data irreversibly, with no reasonable way to re-identify a person — truly anonymous data falls outside GDPR. Pseudonymization replaces direct identifiers (with a token, a code), but re-identification remains possible via separately held information — pseudonymized data remains personal data and stays under GDPR.

The boundary is not a vocabulary detail: it decides whether you still carry obligations (legal basis, retention, data-subject rights, processing register) or not. And the official bar is high. For data to count as anonymous, the European reference authority (the WP29, now the EDPB) requires that none of three risks remain: singling out (isolating an individual in the dataset), linkability (connecting two records to the same person) and inference (deducing information about a person with high probability). On free text, eliminating all three at once is extremely hard.

Total, irreversible anonymization is a rarely achieved grail on rich text. In practice you mostly do pseudonymization — and must never pretend otherwise, neither internally nor to an auditor.

Why it is hard on unstructured text

Anonymizing an "email" column is simple. Anonymizing free text — a report, a support ticket, a contract — is another matter:

  • personal data is diffuse: a name mid-sentence, a role + city that uniquely identify someone, an interview date cross-referenced with a department.
  • context re-identifies: "the CFO of our Lyon subsidiary" contains no name but designates one person.
  • removal degrades utility: over-redact and the passage becomes useless for RAG.

Three concrete examples of re-identification by context

The danger is rarely the explicit name you masked — it is everything left around it. A few cases seen in production:

  1. The cross-referenced quasi-identifier. An HR note says: "the only person on the Data team who joined in March 2024." No name, no email. But anyone with the org chart finds the individual in ten seconds. The combination department + start date + uniqueness is a fully working indirect identifier.
  2. The leak through style and detail. A support ticket reads: "like last time, my dog Pixel walked on the keyboard during the Tuesday call with the Nantes team." The pet's name, the recurring slot and the office are enough to single out the author within an internal corpus.
  3. Aggregation across chunks. Taken alone, two passages are harmless. But RAG retrieves them together: chunk A mentions "our wine-sector client in Gironde, 12 employees," chunk B "the owner who divorced last year." Concatenated at answer time, they reconstruct a nameable person. This is the classic blind spot: we reason document by document, while RAG reasons by cross-matching.

The central dilemma: the more you protect, the less utility you keep — and vice versa.

Available techniques

1. Named-entity recognition (NER)

Automatically spot names, emails, phones, addresses, numbers (IBAN, social-security, license plates). It is the base layer, powered by specialized models (spaCy, Microsoft Presidio, fine-tuned transformers, or an LLM in extraction mode). Essential but imperfect: it trips on domain jargon, typos, unusual formats, rare or foreign names, and above all on indirect identifiers (a role, a location) that no entity detector flags as "personal data." Expect high but never total recall: plan a safety net (human review on sensitive corpora, domain pattern lists).

2. Masking / removal

Replace the entity with a fixed marker ([NAME], [EMAIL]). Simple and robust, but it breaks consistency: two mentions of the same name become indistinguishable. The text "[NAME] forwarded the file to [NAME], who approved it" becomes unreadable to a model — impossible to tell whether it is one person or two. Acceptable for plain log scrubbing, poor for conversational RAG.

3. Consistent pseudonymization

Replace each entity with a stable, unique token (PERSON_42), so that every occurrence of the same person gets the same token across the whole corpus. This is what preserves relationships — hence answer utility. Before/after example:

# BEFORE
Marie Dupont (marie.dupont@acme.fr), legal director, negotiated the
contract with Acme Corp then had it signed by Jean Martin, her deputy.
Marie Dupont remains the point of contact for the case.

# AFTER (consistent pseudonymization)
PERSON_1 (EMAIL_1), ROLE_1, negotiated the contract with ORG_1 then
had it signed by PERSON_2, her deputy. PERSON_1 remains the point of
contact for the case.

The model keeps the structuring information: there are two distinct people, a reporting relationship, an organization, and PERSON_1 does appear twice. A question like "who is the contact for the Acme case?" returns PERSON_1, which the application can re-substitute with "Marie Dupont" at display time if and only if the user is authorized — via the mapping vault described below.

4. Generalization

Replace a precise value with a category: "34 years old" → "30-40," "March 12, 2024" → "Q1 2024," "Lyon" → "a major French city." Reduces the singling-out risk at the cost of granularity. It is the ideal tool against the quasi-identifiers seen above, but dose it carefully: generalizing "CFO of the Lyon subsidiary" into "senior executive" can strip the sentence of all analytical value.

5. Encryption / mapping vault

Keep the table linking each token (PERSON_42) to its real value ("Marie Dupont") in a separate, encrypted store under strict access control, with its own retention and its own logging. This vault is what makes pseudonymization reversible — and its very existence is what keeps the data within GDPR scope.

For RAG, consistent pseudonymization is often the best compromise: it preserves relationships between entities (hence answer utility) while removing direct identifiers from embeddings and prompts. Generalization complements it, neutralizing the quasi-identifiers NER cannot see.

The utility/protection trade-off, illustrated

There is no universal setting: the right dose depends on corpus and use. Take one passage at three treatment levels:

  • Raw: "Dr. Sophie Berger, oncologist at Bordeaux University Hospital, recommends protocol X for patients over 65." — maximum utility, zero protection.
  • Consistently pseudonymized: "PERSON7, ROLE3 at ORG_2, recommends protocol X for patients over 65." — the medical recommendation and the age criterion stay usable, the doctor's identity leaves the embeddings.
  • Aggressively generalized: "A specialist recommends a protocol for certain patients." — strong protection, but the content is now useless for a medical RAG.

The common mistake is to believe "more masking = better." On a legal or medical corpus, over-anonymizing destroys the very value that justified the RAG project. The right approach: classify data by sensitivity, and apply the heaviest treatment only where it is warranted. Sometimes the best answer is not to anonymize more but to restrict access and control hosting.

Limits to face honestly

  • Residual re-identification risk — no NER catches everything, and context can betray an identity even without an explicit identifier. Singling out by cross-matching chunks is a real blind spot.
  • Pseudonymization is not anonymization — as long as a mapping table exists (or could be reconstructed), you process personal data. Don't kid yourself about your GDPR exposure: a pseudonymized corpus is still subject to access, rectification and erasure rights.
  • The utility/protection trade-off — on some corpora, removing personal data empties the content of meaning. The setting is an accepted compromise, not an absolute.

The leak point everyone forgets: the prompts

We often secure ingestion while forgetting the other end of the chain. At answer time, a RAG retrieves chunks and pastes them into a prompt sent to the generation model. If that model is a non-EU API, you have just exported personal data — even if your vector store is perfectly hosted in Europe.

User query
   → vector search (pseudonymized chunks, or not?)
   → prompt assembly
   → LLM API CALL  ← leak point if non-EU and chunks not pseudonymized
   → answer

Two practical consequences. First, if your chunks still contain personal data, every query that surfaces them transmits them to the LLM provider, with all the cross-border-transfer and reuse questions that raises. Second, this is exactly where consistent pseudonymization pays off: by pseudonymizing before vectorization, the retrieved chunks carry only tokens, the external LLM never sees an identity, and re-substitution happens application-side, after the answer, under authorization control. The rule is simple: nothing identifying should cross the boundary of the controlled perimeter.

Good reflexes

  • Minimize at the source — the best personal data is the data you don't ingest. Index only what is useful; exclude fields and documents with no RAG value.
  • Pseudonymize before any send to an external model — the most common leak point: personal data in prompts heading to a non-EU API.
  • Keep the mapping table isolated — separate, encrypted, strictly access-controlled, with its own retention and its own traceability.
  • Address the cross-matching risk — audit what your RAG surfaces together, not just document by document. Generalize the quasi-identifiers that survive NER.
  • Document the processing — which technique, residual risk, legal basis. Expected by GDPR and useful in an audit.
  • Combine with sovereignty — anonymizing does not excuse poor hosting. For the most sensitive data, controlling the infrastructure remains the ultimate protection.

Conclusion

Anonymizing before vectorization is a good intention that meets a technical reality: on rich text, perfect anonymization is rare, and consistent pseudonymization is the realistic compromise. The key is not to lie to yourself about the real level of protection, to address context-based re-identification as much as direct identifiers, to systematically pseudonymize what leaves your perimeter, and to cover the rest with access control and sovereign hosting. Protecting personal data in a RAG is not a checkbox: it is a chain of decisions, from ingestion to inference.

Further reading

  • GDPR and generative AI: the compliance guide for your RAG projects
  • EU AI Act: the 2025-2027 obligations timeline and enterprise checklist

Ready to ground your agents in your data?

Start free. First Knowledge Pulse audit in 60 seconds.

Start free