March 2, 2026
.

What Is Data De-identification? (Definition, Methods, and Compliance)

This guide explores the critical process of data de-identification, a regulatory necessity for organizations handling sensitive PII, PHI, and PCI data. It details the methodologies required to transform identifiable datasets into compliant, low-risk information suitable for research, AI training, and analytics.

Limina
Company
Data De-identification

A single improperly shared patient record can trigger a HIPAA breach investigation, expose an organization to fines of up to $1.5 million per violation category per year (AMA), and permanently damage the trust of the people that data was meant to protect.

Yet, healthcare organizations, pharma companies, and financial institutions sit on huge reserves of sensitive data they need for analytics, AI training, and research—data they cannot legally share or use without taking deliberate, documented steps to protect it.

Data de-identification is how they do it legally and safely.

What is data de-identification? Data de-identification is the process of removing or transforming personally identifiable information (PII) , protected health information (PHI) Payment Card Industry Security ( PCI) from a dataset so that individuals cannot be reasonably identified from it. When done correctly and documented to regulatory standards, de-identified data can be used, shared, and analyzed without triggering the privacy protections that apply to identifiable records.

The gap between "done correctly" and "close enough" is where compliance risk lives. General-purpose tools achieve as low as 60% accuracy on real healthcare data (Bressem et al., 2024). That means up to 40% of sensitive identifiers may survive de-identification untouched—a fact that regulators, researchers, and legal teams cannot afford to ignore.

This guide covers what data de-identification is, how it works under major regulatory frameworks, which methods apply in which contexts, and what organizations need to get right.

What qualifies as identifiable data? PII, PHI, and PCI defined

Before choosing a de-identification method, you need to know what you're de-identifying. Regulators define "identifiable" differently depending on the framework.

PII (Personally identifiable information)

PII is any information that can be used—alone or in combination—to identify a specific individual. The definition appears in multiple U.S. federal laws (including NIST SP 800-122) and is the basis for most state privacy laws including GDPR and CPRA.

Common PII examples:

  • Full name, alias, or username
  • Home address, email address, phone number
  • Social Security number, passport number, driver's license
  • IP addresses, device identifiers, cookie IDs
  • Biometric data (fingerprints, facial geometry)
  • Geolocation data

PHI (Protected health information)

PHI is a subset of PII relating to health information, a concept that, while coined by HIPAA, extends across many jurisdictions globally. It includes any health information that is created, received, or maintained by a covered entity or business associate that relates to an individual's past, present, or future health condition, healthcare services, or payment for those services, and that includes one or more of 18 specific identifiers.

HIPAA's 18 PHI Identifiers:

Category Examples
Names Full name, first name alone (if rare enough to identify)
Geographic data Street address, city, county, precinct, ZIP code (first 3 digits in small populations), geocodes
Dates Birth dates, admission dates, discharge dates, death dates, and all ages over 89
Phone numbers All telephone numbers
Fax numbers All fax numbers
Email addresses All electronic mail addresses
Social Security numbers SSNs
Medical record numbers Any medical record identifier
Health plan beneficiary numbers Insurance member/beneficiary IDs
Account numbers Financial account numbers tied to health records
Certificate/license numbers Any license or certificate number
Vehicle identifiers Serial numbers, license plate numbers
Device identifiers Serial numbers, IMEI numbers
URLs Web addresses associated with individuals
IP addresses Internet Protocol addresses associated with individuals
Biometric identifiers Finger and voice prints
Full-face photographs Photos or comparable images that could identify an individual
Any unique identifying number or code Any characteristic or code not listed above that could identify a person

PCI (Payment card industry data)

PCI data refers to payment card information governed by the PCI DSS (Payment Card Industry Data Security Standard). It includes:

  • Primary account numbers (PANs)
  • Cardholder names
  • Expiration dates
  • Service codes
  • Sensitive authentication data

De-identifying PCI data—particularly in contact center call recordings or chat logs—is increasingly a compliance requirement for financial services organizations.

What is data de-identification? the regulatory definitions that matter

"De-identification" is not a generic technical term. It carries specific legal meaning under different frameworks. The method you use must meet the standard required by the regulation that governs your data.

De-identification under HIPAA

HIPAA provides two legally recognized pathways for de-identifying PHI. Meeting either standard means the data is no longer classified as PHI, which means HIPAA's Privacy Rule no longer applies to it.

Method 1: Safe harbor

The Safe Harbor method requires removing all 18 identifiers listed above. In addition, the covered entity must have no actual knowledge that the remaining information could be used to identify an individual. Safe Harbor is rules-based and auditable. You either removed the identifiers or you didn't.

Method 2: Expert determination

Expert Determination requires a qualified statistician or expert to apply generally accepted principles to determine that the risk of identifying any individual in the dataset is "very small." The expert must document the analysis and the supporting methods. This approach is more flexible but requires a real expert, real documentation, and a process you could defend in an audit..

Most enterprise compliance programs rely on Expert Determination when they need to preserve more data utility. For example, retaining three-digit ZIP codes or more granular date ranges for epidemiological research. Safe Harbor, by contrast, is typically faster to implement but produces lower-utility data.

De-identification under GDPR

GDPR uses the concept of "pseudonymization" and "anonymization" rather than a formal de-identification standard. Truly anonymized data—where re-identification is not reasonably possible—falls outside GDPR's scope entirely. Pseudonymized data (where identifying elements are replaced but a key exists to reverse the process) still qualifies as personal data under GDPR and remains subject to its protections.

For practical purposes, the bar for organizations processing data under GDPR is whether re-identification is reasonably possible, considering the technology and resources available at the time of processing.

De-identification under CPRA / CCPA

California's Consumer Privacy Rights Act introduced the concept of "deidentified" information as a formal category. To qualify, the data must reach a point where linking it back to a specific consumer isn't reasonably possible.; the business must implement technical and administrative safeguards against re-identification; and the business must publicly commit to not re-identifying the data. CPRA also prohibits businesses from attempting to re-identify previously de-identified data.

Methods of data de-identification

The method you choose should match your regulatory framework, your data type, and your downstream use case. Here are the primary techniques used in practice.

1. Data redaction

Redaction removes or blacks out sensitive information entirely, replacing it with a placeholder (e.g., [REDACTED] or █████). This is the most conservative approach and preserves no information value from the removed element.

Best for: Legal documents, regulatory submissions, public records fulfillment, any context where retaining data utility from the removed field is not required.

Limitation: Destroys data utility. A dataset of redacted medical notes tells you nothing about the patient population.

2. Data masking

Masking replaces sensitive values with realistic but fictitious substitutes—a real-looking name, address, or account number that points to no actual individual. Masked data maintains structural integrity (it looks real) without exposing genuine identifiers.

Best for: Software testing, developer environments, QA pipelines where systems need realistic-looking data to function but must not expose real records.

Types of masking:

  • Static masking: Data is masked once, permanently
  • Dynamic masking: Data is masked at query time based on the requesting user's role
  • Format-preserving masking: The masked value retains the format of the original (e.g., a 16-digit card number is replaced with another 16-digit number)

3. Pseudonymization

Pseudonymization replaces direct identifiers with artificial identifiers (pseudonyms or tokens), while retaining a mapping key that can in principle reverse the process. Unlike full anonymization, pseudonymized data is still considered personal data under GDPR.

Best for: Clinical trials, longitudinal research, internal analytics where you need to track the same individual across records without exposing their identity to downstream teams.

Key requirement: The mapping key must be stored separately from the pseudonymized data, under strict access controls.

4. Generalization

Generalization reduces the precision of data rather than removing it—replacing an exact birth date with a birth year, a precise address with a ZIP code, or a specific diagnosis with a broader disease category. The result preserves statistical utility at the cost of individual precision.

Best for: Population health analytics, public health datasets, research that needs demographic or clinical signals without individual-level granularity.

5. Data suppression

Suppression removes entire records or specific fields that pose an outsized re-identification risk. For example, removing records for patients with extremely rare diseases who could be identified by their diagnosis alone, even without direct identifiers.

Best for: Any de-identification workflow where outlier records create unacceptable re-identification risk. Suppression is typically used alongside other methods, not in isolation.

6. Noise addition / perturbation

Perturbation adds controlled statistical noise to numerical values, slightly altering ages, income figures, or lab values in ways that are indistinguishable at the individual level but preserve aggregate statistical distributions. Common in statistical disclosure limitation (SDL) for public datasets.

Best for: Published datasets, census data, financial aggregates where population-level trends are needed but individual-level precision must be eliminated.

Choosing the right method: A quick reference

Use Case Recommended Method(s)
Clinical trial data sharing Pseudonymization + Suppression
Public research dataset publication Expert Determination + Generalization + Noise Addition
Software testing / QA Data Masking
Legal discovery fulfillment Redaction
Contact center call recording compliance Redaction or Masking of PCI/PII in transcripts
HIPAA Safe Harbor compliance Removal of all 18 identifiers
GDPR anonymization Combination approach with re-identification risk assessment

How to de-identify data: A step-by-step process

De-identification is not a single action. It is a documented workflow. For HIPAA Expert Determination in particular, the process must be auditable.

1. Inventory your data Map where sensitive data lives: structured databases, unstructured documents, call recordings, clinical notes, chat logs, scanned forms. You cannot de-identify what you have not found.

2. Classify the data Determine which regulatory framework applies (HIPAA, GDPR, CPRA, PCI DSS) and which entity types are present (PHI, PII, PCI). Different fields may fall under different standards within a single dataset.

3. Select the appropriate method(s) Match method to use case (see table above). Complex datasets typically require combining methods—generalization for dates, redaction for free-text fields, suppression for rare-condition outliers.

4. Apply de-identification Execute the de-identification process. For unstructured text—clinical notes, call transcripts, chat logs, intake forms—this requires NLP-based entity recognition capable of identifying sensitive information in natural language, not just structured fields.

5. Assess residual re-identification risk Evaluate whether the de-identified output can be re-identified, either through the remaining data itself or through linkage with external datasets. This step is required for HIPAA Expert Determination and GDPR anonymization.

6. Document the process Record the method used, the expert or system that performed the analysis, the date, and the residual risk determination. This documentation is what makes your de-identification defensible in an audit or investigation.

7. Govern ongoing use De-identification is not a one-time event. Establish policies governing who can access de-identified data, for what purposes, and with what safeguards against re-identification.

PII de-identification vs. PHI de-identification: Key differences

While PII and PHI de-identification share the same goal—removing individual identifiers—they differ in regulatory specificity, required rigor, and consequences for failure.

Dimension PII De-identification PHI De-identification
Governing framework GDPR, CPRA, state laws (varies) HIPAA (federal)
Defined identifier list No universal list; context-dependent Yes — 18 specific identifiers
Formal compliance pathway No equivalent to Safe Harbor Safe Harbor + Expert Determination
Required documentation Varies by framework Expert Determination requires written analysis
Penalty for breach Varies widely by state/country Up to $1.9M per violation category per year
Re-identification prohibition CPRA explicitly prohibits; GDPR implied No explicit HIPAA prohibition, but re-identified data becomes PHI again

For organizations operating across jurisdictions (a pharma company running trials in the U.S. and EU, for example) both sets of requirements apply simultaneously, which means de-identification workflows must satisfy the stricter of the two standards at every step.

Common de-identification failures (and why they happen)

Even well-intentioned de-identification programs fail. Understanding how is the first step to preventing it.

1. Missing indirect identifiers A dataset with name and SSN removed can still be re-identified if it contains a combination of ZIP code, birth date, and sex—a combination that uniquely identifies 87% of the U.S. population, according to research by Latanya Sweeney (Simple Demographics Often Identify People Uniquely, Carnegie Mellon University, Data Privacy Working Paper 3, 2000). De-identification tools that focus only on direct identifiers miss this class of risk entirely.

2. Inconsistent handling of unstructured text Clinical notes, discharge summaries, and call transcripts contain sensitive information in natural language—embedded in sentences, misspelled, abbreviated, or expressed across multiple lines. Structured-field redaction tools do not catch these. NLP-based entity recognition is required.

3. Low-accuracy tooling .General-purpose language models and cloud NLP tools achieve as low as 60% accuracy on real healthcare data (Bressem et al., 2024). That is not a minor shortfall. It means one in three or four sensitive identifiers may survive de-identification. At scale, in a dataset of 100,000 clinical notes, that is tens of thousands of exposed identifiers.

4. Missing documentation The de-identification process happened but was not recorded. Without a documented analysis, Expert Determination cannot be asserted in an audit. The work is invisible to regulators.

5. Re-identification through data linkage De-identified data is released and later re-identified by joining it with a publicly available dataset, a known attack vector. Risk assessment must account for what external data is available, not just what is in the dataset itself.

De-identify data HIPAA: What expert determination actually requires

HIPAA Expert Determination is frequently cited but poorly understood in practice. Here is what the standard actually demands.

To satisfy Expert Determination, a covered entity must:

  • Engage a qualified person: someone with knowledge of and experience applying accepted principles of statistical and scientific methods to render information not individually identifiable
  • Apply those methods: not simply assert that data is de-identified
  • Determine that risk is very small: "very small" is the regulatory standard; there is no defined numerical threshold, which means the expert must exercise and document genuine judgment
  • Document the analysis: the determination must include the methods and results of the analysis
  • Retain the documentation: the covered entity must keep the expert's analysis for potential regulatory review

Expert Determination gives organizations more flexibility than Safe Harbor (for example, retaining age over 89 as a single category rather than redacting it), but it requires genuine statistical expertise and a documented, reproducible process.

Platforms purpose-built for PHI de-identification—like Limina—are designed to produce the outputs and documentation that support Expert Determination, rather than requiring compliance teams to reconstruct the analysis after the fact.

Platforms purpose-built for PHI de-identification—like Limina—are designed to produce auditor-ready Expert Determination reports, including re-identification risk probabilities across both direct and quasi-identifiers, rather than requiring compliance teams to reconstruct the analysis after the fact.

Ready to de-identify your data to regulatory standards?

Most de-identification failures come down to three things: incomplete identifier coverage, low-accuracy tooling, and missing documentation. Limina is built to solve all three—with 99.96% expert determination accuracy on real healthcare data, support for 52 languages and 50+ entity types across unstructured text, and outputs designed to support Expert Determination documentation and HIPAA, GDPR, and CPRA compliance.

https://getlimina.ai/en/contact-us — see how enterprise-grade de-identification works on your actual data.

Read the complete data de-identification guide — our parent resource covering HIPAA Expert Determination, GDPR anonymization, and de-identification strategy for regulated industries.

Related Articles

Frequently Asked Questions

What is the difference between de-identification and anonymization?

De-identification and anonymization are often used interchangeably, but they carry different legal meanings depending on the framework. Under HIPAA, "de-identification" is the formal standard with two defined pathways (Safe Harbor and Expert Determination). Under GDPR, "anonymization" describes data where re-identification is not reasonably possible and the regulation no longer applies. Pseudonymized data — where a mapping key exists — is not anonymous under GDPR and remains subject to its protections.

Is de-identified data still subject to HIPAA?

No. Data that has been properly de-identified under either the Safe Harbor or Expert Determination method is no longer considered PHI and is not subject to the HIPAA Privacy Rule. However, if de-identified data is later re-identified — intentionally or through linkage — it becomes PHI again and HIPAA protections reattach.

What are the 18 HIPAA identifiers that must be removed for safe harbor de-identification?

The 18 identifiers include names, geographic data smaller than a state, all dates (except year) for individuals over 89, phone and fax numbers, email addresses, Social Security numbers, medical record and health plan beneficiary numbers, account numbers, certificate or license numbers, vehicle and device identifiers, URLs, IP addresses, biometric identifiers including fingerprints and voice prints, full-face photographs, and any other unique identifier or code. All 18 must be removed or the covered entity must have no actual knowledge the remaining data could identify an individual.

Can you de-identify unstructured text like clinical notes or call recordings?

Yes, but it requires more sophisticated tooling than structured-data de-identification. Clinical notes, discharge summaries, chat logs, and call transcripts contain sensitive information embedded in natural language — context-dependent, often abbreviated, and sometimes misspelled. Effective de-identification of unstructured text requires NLP-based named entity recognition trained on domain-specific data, not generic pattern matching or keyword lists.

What is the difference between data masking and data de-identification?

Data masking replaces sensitive values with realistic fictitious substitutes, primarily to enable safe use in non-production environments like software testing. Data de-identification is a broader regulatory concept covering any process that makes data non-identifiable to a defined legal standard. Masking is one technique used within a de-identification program, but de-identification for HIPAA or GDPR compliance requires meeting a specific legal standard — not simply substituting values.

How accurate does de-identification need to be for HIPAA compliance?

HIPAA does not specify a numerical accuracy threshold for de-identification. Safe Harbor requires that all 18 identifiers be removed and that no actual knowledge of re-identification exists. Expert Determination requires that risk of identification be "very small" — a qualitative standard that the qualified expert must assess and document. In practice, the higher the accuracy of the de-identification tool, the stronger the defensible position in an audit or investigation.

What is re-identification risk, and how is it assessed?

Re-identification risk is the probability that an individual in a de-identified dataset could be identified by an adversary with access to external information. It is assessed by considering the attributes remaining in the dataset (particularly quasi-identifiers like age, sex, and ZIP code), the population size, and the external datasets an adversary could plausibly use for linkage. Risk assessment is a required element of HIPAA Expert Determination and GDPR anonymization analysis, and it must be documented.