March 2, 2026

What Is Data De-identification? (Definition, Methods, and Compliance)

This guide explores the critical process of data de-identification, a regulatory necessity for organizations handling sensitive PII, PHI, and PCI data. It details the methodologies required to transform identifiable datasets into compliant, low-risk information suitable for research, AI training, and analytics.

Limina

Company

A single improperly shared patient record can trigger a HIPAA breach investigation, expose an organization to fines of up to $1.5 million per violation category per year (AMA), and permanently damage the trust of the people that data was meant to protect.

Yet, healthcare organizations, pharma companies, and financial institutions sit on huge reserves of sensitive data they need for analytics, AI training, and research—data they cannot legally share or use without taking deliberate, documented steps to protect it.

Data de-identification is how they do it legally and safely.

What is data de-identification? Data de-identification is the process of removing or transforming personally identifiable information (PII) , protected health information (PHI) Payment Card Industry Security ( PCI) from a dataset so that individuals cannot be reasonably identified from it. When done correctly and documented to regulatory standards, de-identified data can be used, shared, and analyzed without triggering the privacy protections that apply to identifiable records.

The gap between "done correctly" and "close enough" is where compliance risk lives. General-purpose tools achieve as low as 60% accuracy on real healthcare data (Bressem et al., 2024). That means up to 40% of sensitive identifiers may survive de-identification untouched—a fact that regulators, researchers, and legal teams cannot afford to ignore.

This guide covers what data de-identification is, how it works under major regulatory frameworks, which methods apply in which contexts, and what organizations need to get right.

What qualifies as identifiable data? PII, PHI, and PCI defined

Before choosing a de-identification method, you need to know what you're de-identifying. Regulators define "identifiable" differently depending on the framework.

PII (Personally identifiable information)

PII is any information that can be used—alone or in combination—to identify a specific individual. The definition appears in multiple U.S. federal laws (including NIST SP 800-122) and is the basis for most state privacy laws including GDPR and CPRA.

Common PII examples:

Full name, alias, or username
Home address, email address, phone number
Social Security number, passport number, driver's license
IP addresses, device identifiers, cookie IDs
Biometric data (fingerprints, facial geometry)
Geolocation data

PHI (Protected health information)

PHI is a subset of PII relating to health information, a concept that, while coined by HIPAA, extends across many jurisdictions globally. It includes any health information that is created, received, or maintained by a covered entity or business associate that relates to an individual's past, present, or future health condition, healthcare services, or payment for those services, and that includes one or more of 18 specific identifiers.

HIPAA's 18 PHI Identifiers:

Category	Examples
Names	Full name, first name alone (if rare enough to identify)
Geographic data	Street address, city, county, precinct, ZIP code (first 3 digits in small populations), geocodes
Dates	Birth dates, admission dates, discharge dates, death dates, and all ages over 89
Phone numbers	All telephone numbers
Fax numbers	All fax numbers
Email addresses	All electronic mail addresses
Social Security numbers	SSNs
Medical record numbers	Any medical record identifier
Health plan beneficiary numbers	Insurance member/beneficiary IDs
Account numbers	Financial account numbers tied to health records
Certificate/license numbers	Any license or certificate number
Vehicle identifiers	Serial numbers, license plate numbers
Device identifiers	Serial numbers, IMEI numbers
URLs	Web addresses associated with individuals
IP addresses	Internet Protocol addresses associated with individuals
Biometric identifiers	Finger and voice prints
Full-face photographs	Photos or comparable images that could identify an individual
Any unique identifying number or code	Any characteristic or code not listed above that could identify a person

PCI (Payment card industry data)

PCI data refers to payment card information governed by the PCI DSS (Payment Card Industry Data Security Standard). It includes:

Primary account numbers (PANs)
Cardholder names
Expiration dates
Service codes
Sensitive authentication data

De-identifying PCI data—particularly in contact center call recordings or chat logs—is increasingly a compliance requirement for financial services organizations.

What is data de-identification? the regulatory definitions that matter

"De-identification" is not a generic technical term. It carries specific legal meaning under different frameworks. The method you use must meet the standard required by the regulation that governs your data.

De-identification under HIPAA

HIPAA provides two legally recognized pathways for de-identifying PHI. Meeting either standard means the data is no longer classified as PHI, which means HIPAA's Privacy Rule no longer applies to it.

Method 1: Safe harbor

The Safe Harbor method requires removing all 18 identifiers listed above. In addition, the covered entity must have no actual knowledge that the remaining information could be used to identify an individual. Safe Harbor is rules-based and auditable. You either removed the identifiers or you didn't.

Method 2: Expert determination

Expert Determination requires a qualified statistician or expert to apply generally accepted principles to determine that the risk of identifying any individual in the dataset is "very small." The expert must document the analysis and the supporting methods. This approach is more flexible but requires a real expert, real documentation, and a process you could defend in an audit..

Most enterprise compliance programs rely on Expert Determination when they need to preserve more data utility. For example, retaining three-digit ZIP codes or more granular date ranges for epidemiological research. Safe Harbor, by contrast, is typically faster to implement but produces lower-utility data.

De-identification under GDPR

GDPR uses the concept of "pseudonymization" and "anonymization" rather than a formal de-identification standard. Truly anonymized data—where re-identification is not reasonably possible—falls outside GDPR's scope entirely. Pseudonymized data (where identifying elements are replaced but a key exists to reverse the process) still qualifies as personal data under GDPR and remains subject to its protections.

For practical purposes, the bar for organizations processing data under GDPR is whether re-identification is reasonably possible, considering the technology and resources available at the time of processing.

De-identification under CPRA / CCPA

California's Consumer Privacy Rights Act introduced the concept of "deidentified" information as a formal category. To qualify, the data must reach a point where linking it back to a specific consumer isn't reasonably possible.; the business must implement technical and administrative safeguards against re-identification; and the business must publicly commit to not re-identifying the data. CPRA also prohibits businesses from attempting to re-identify previously de-identified data.

Methods of data de-identification

The method you choose should match your regulatory framework, your data type, and your downstream use case. Here are the primary techniques used in practice.

1. Data redaction

Redaction removes or blacks out sensitive information entirely, replacing it with a placeholder (e.g., [REDACTED] or █████). This is the most conservative approach and preserves no information value from the removed element.

Best for: Legal documents, regulatory submissions, public records fulfillment, any context where retaining data utility from the removed field is not required.

Limitation: Destroys data utility. A dataset of redacted medical notes tells you nothing about the patient population.

2. Data masking

Masking replaces sensitive values with realistic but fictitious substitutes—a real-looking name, address, or account number that points to no actual individual. Masked data maintains structural integrity (it looks real) without exposing genuine identifiers.

Best for: Software testing, developer environments, QA pipelines where systems need realistic-looking data to function but must not expose real records.

Types of masking:

Static masking: Data is masked once, permanently
Dynamic masking: Data is masked at query time based on the requesting user's role
Format-preserving masking: The masked value retains the format of the original (e.g., a 16-digit card number is replaced with another 16-digit number)

3. Pseudonymization

Pseudonymization replaces direct identifiers with artificial identifiers (pseudonyms or tokens), while retaining a mapping key that can in principle reverse the process. Unlike full anonymization, pseudonymized data is still considered personal data under GDPR.

Best for: Clinical trials, longitudinal research, internal analytics where you need to track the same individual across records without exposing their identity to downstream teams.

Key requirement: The mapping key must be stored separately from the pseudonymized data, under strict access controls.

4. Generalization

Generalization reduces the precision of data rather than removing it—replacing an exact birth date with a birth year, a precise address with a ZIP code, or a specific diagnosis with a broader disease category. The result preserves statistical utility at the cost of individual precision.

Best for: Population health analytics, public health datasets, research that needs demographic or clinical signals without individual-level granularity.

5. Data suppression

Suppression removes entire records or specific fields that pose an outsized re-identification risk. For example, removing records for patients with extremely rare diseases who could be identified by their diagnosis alone, even without direct identifiers.

Best for: Any de-identification workflow where outlier records create unacceptable re-identification risk. Suppression is typically used alongside other methods, not in isolation.

6. Noise addition / perturbation

Perturbation adds controlled statistical noise to numerical values, slightly altering ages, income figures, or lab values in ways that are indistinguishable at the individual level but preserve aggregate statistical distributions. Common in statistical disclosure limitation (SDL) for public datasets.

Best for: Published datasets, census data, financial aggregates where population-level trends are needed but individual-level precision must be eliminated.

Choosing the right method: A quick reference

Use Case	Recommended Method(s)
Clinical trial data sharing	Pseudonymization + Suppression
Public research dataset publication	Expert Determination + Generalization + Noise Addition
Software testing / QA	Data Masking
Legal discovery fulfillment	Redaction
Contact center call recording compliance	Redaction or Masking of PCI/PII in transcripts
HIPAA Safe Harbor compliance	Removal of all 18 identifiers
GDPR anonymization	Combination approach with re-identification risk assessment

How to de-identify data: A step-by-step process

De-identification is not a single action. It is a documented workflow. For HIPAA Expert Determination in particular, the process must be auditable.

1. Inventory your data Map where sensitive data lives: structured databases, unstructured documents, call recordings, clinical notes, chat logs, scanned forms. You cannot de-identify what you have not found.

2. Classify the data Determine which regulatory framework applies (HIPAA, GDPR, CPRA, PCI DSS) and which entity types are present (PHI, PII, PCI). Different fields may fall under different standards within a single dataset.

3. Select the appropriate method(s) Match method to use case (see table above). Complex datasets typically require combining methods—generalization for dates, redaction for free-text fields, suppression for rare-condition outliers.

4. Apply de-identification Execute the de-identification process. For unstructured text—clinical notes, call transcripts, chat logs, intake forms—this requires NLP-based entity recognition capable of identifying sensitive information in natural language, not just structured fields.

5. Assess residual re-identification risk Evaluate whether the de-identified output can be re-identified, either through the remaining data itself or through linkage with external datasets. This step is required for HIPAA Expert Determination and GDPR anonymization.

6. Document the process Record the method used, the expert or system that performed the analysis, the date, and the residual risk determination. This documentation is what makes your de-identification defensible in an audit or investigation.

7. Govern ongoing use De-identification is not a one-time event. Establish policies governing who can access de-identified data, for what purposes, and with what safeguards against re-identification.

PII de-identification vs. PHI de-identification: Key differences

While PII and PHI de-identification share the same goal—removing individual identifiers—they differ in regulatory specificity, required rigor, and consequences for failure.

Dimension	PII De-identification	PHI De-identification
Governing framework	GDPR, CPRA, state laws (varies)	HIPAA (federal)
Defined identifier list	No universal list; context-dependent	Yes — 18 specific identifiers
Formal compliance pathway	No equivalent to Safe Harbor	Safe Harbor + Expert Determination
Required documentation	Varies by framework	Expert Determination requires written analysis
Penalty for breach	Varies widely by state/country	Up to $1.9M per violation category per year
Re-identification prohibition	CPRA explicitly prohibits; GDPR implied	No explicit HIPAA prohibition, but re-identified data becomes PHI again

For organizations operating across jurisdictions (a pharma company running trials in the U.S. and EU, for example) both sets of requirements apply simultaneously, which means de-identification workflows must satisfy the stricter of the two standards at every step.

Common de-identification failures (and why they happen)

Even well-intentioned de-identification programs fail. Understanding how is the first step to preventing it.

1. Missing indirect identifiers A dataset with name and SSN removed can still be re-identified if it contains a combination of ZIP code, birth date, and sex—a combination that uniquely identifies 87% of the U.S. population, according to research by Latanya Sweeney (Simple Demographics Often Identify People Uniquely, Carnegie Mellon University, Data Privacy Working Paper 3, 2000). De-identification tools that focus only on direct identifiers miss this class of risk entirely.

2. Inconsistent handling of unstructured text Clinical notes, discharge summaries, and call transcripts contain sensitive information in natural language—embedded in sentences, misspelled, abbreviated, or expressed across multiple lines. Structured-field redaction tools do not catch these. NLP-based entity recognition is required.

3. Low-accuracy tooling .General-purpose language models and cloud NLP tools achieve as low as 60% accuracy on real healthcare data (Bressem et al., 2024). That is not a minor shortfall. It means one in three or four sensitive identifiers may survive de-identification. At scale, in a dataset of 100,000 clinical notes, that is tens of thousands of exposed identifiers.

4. Missing documentation The de-identification process happened but was not recorded. Without a documented analysis, Expert Determination cannot be asserted in an audit. The work is invisible to regulators.

5. Re-identification through data linkage De-identified data is released and later re-identified by joining it with a publicly available dataset, a known attack vector. Risk assessment must account for what external data is available, not just what is in the dataset itself.

De-identify data HIPAA: What expert determination actually requires

HIPAA Expert Determination is frequently cited but poorly understood in practice. Here is what the standard actually demands.

To satisfy Expert Determination, a covered entity must:

Engage a qualified person: someone with knowledge of and experience applying accepted principles of statistical and scientific methods to render information not individually identifiable
Apply those methods: not simply assert that data is de-identified
Determine that risk is very small: "very small" is the regulatory standard; there is no defined numerical threshold, which means the expert must exercise and document genuine judgment
Document the analysis: the determination must include the methods and results of the analysis
Retain the documentation: the covered entity must keep the expert's analysis for potential regulatory review

Expert Determination gives organizations more flexibility than Safe Harbor (for example, retaining age over 89 as a single category rather than redacting it), but it requires genuine statistical expertise and a documented, reproducible process.

Platforms purpose-built for PHI de-identification—like Limina—are designed to produce the outputs and documentation that support Expert Determination, rather than requiring compliance teams to reconstruct the analysis after the fact.

Platforms purpose-built for PHI de-identification—like Limina—are designed to produce auditor-ready Expert Determination reports, including re-identification risk probabilities across both direct and quasi-identifiers, rather than requiring compliance teams to reconstruct the analysis after the fact.

Ready to de-identify your data to regulatory standards?

Most de-identification failures come down to three things: incomplete identifier coverage, low-accuracy tooling, and missing documentation. Limina is built to solve all three—with 99.96% expert determination accuracy on real healthcare data, support for 52 languages and 50+ entity types across unstructured text, and outputs designed to support Expert Determination documentation and HIPAA, GDPR, and CPRA compliance.

https://getlimina.ai/en/contact-us — see how enterprise-grade de-identification works on your actual data.

Read the complete data de-identification guide — our parent resource covering HIPAA Expert Determination, GDPR anonymization, and de-identification strategy for regulated industries.

Share this post

Copy link

Frequently Asked Questions

What is the difference between de-identification and anonymization?

De-identification and anonymization are often used interchangeably, but they carry different legal meanings depending on the framework. Under HIPAA, "de-identification" is the formal standard with two defined pathways (Safe Harbor and Expert Determination). Under GDPR, "anonymization" describes data where re-identification is not reasonably possible and the regulation no longer applies. Pseudonymized data — where a mapping key exists — is not anonymous under GDPR and remains subject to its protections.

Is de-identified data still subject to HIPAA?

No. Data that has been properly de-identified under either the Safe Harbor or Expert Determination method is no longer considered PHI and is not subject to the HIPAA Privacy Rule. However, if de-identified data is later re-identified — intentionally or through linkage — it becomes PHI again and HIPAA protections reattach.

What are the 18 HIPAA identifiers that must be removed for safe harbor de-identification?

The 18 identifiers include names, geographic data smaller than a state, all dates (except year) for individuals over 89, phone and fax numbers, email addresses, Social Security numbers, medical record and health plan beneficiary numbers, account numbers, certificate or license numbers, vehicle and device identifiers, URLs, IP addresses, biometric identifiers including fingerprints and voice prints, full-face photographs, and any other unique identifier or code. All 18 must be removed or the covered entity must have no actual knowledge the remaining data could identify an individual.

Can you de-identify unstructured text like clinical notes or call recordings?

Yes, but it requires more sophisticated tooling than structured-data de-identification. Clinical notes, discharge summaries, chat logs, and call transcripts contain sensitive information embedded in natural language — context-dependent, often abbreviated, and sometimes misspelled. Effective de-identification of unstructured text requires NLP-based named entity recognition trained on domain-specific data, not generic pattern matching or keyword lists.

What is the difference between data masking and data de-identification?

Data masking replaces sensitive values with realistic fictitious substitutes, primarily to enable safe use in non-production environments like software testing. Data de-identification is a broader regulatory concept covering any process that makes data non-identifiable to a defined legal standard. Masking is one technique used within a de-identification program, but de-identification for HIPAA or GDPR compliance requires meeting a specific legal standard — not simply substituting values.

How accurate does de-identification need to be for HIPAA compliance?

HIPAA does not specify a numerical accuracy threshold for de-identification. Safe Harbor requires that all 18 identifiers be removed and that no actual knowledge of re-identification exists. Expert Determination requires that risk of identification be "very small" — a qualitative standard that the qualified expert must assess and document. In practice, the higher the accuracy of the de-identification tool, the stronger the defensible position in an audit or investigation.

What is re-identification risk, and how is it assessed?

Re-identification risk is the probability that an individual in a de-identified dataset could be identified by an adversary with access to external information. It is assessed by considering the attributes remaining in the dataset (particularly quasi-identifiers like age, sex, and ZIP code), the population size, and the external datasets an adversary could plausibly use for linkage. Risk assessment is a required element of HIPAA Expert Determination and GDPR anonymization analysis, and it must be documented.