May 3, 2022

What Companies Should Know About PII & Protecting It

Personally Identifiable Information is one of the most regulated and misunderstood categories of data in business today. Here is what your organization needs to know about defining, managing, and protecting it.

What Companies Should Know About PII and How to Protect It

Personally Identifiable Information, more commonly known as PII, sits at the center of nearly every data privacy conversation happening in business today. Regulations are tightening, enforcement is increasing, and the individuals whose data companies collect are paying closer attention than ever before. And yet, despite all of this visibility, PII remains widely misunderstood, particularly when it comes to what it actually includes, who bears responsibility for it, and what genuine protection looks like in practice.

This article breaks down the fundamentals that every company should understand about PII, from its legal definition and regulatory context to what a meaningful data protection strategy actually requires.

What Is PII, and Why Is the Definition More Complicated Than It Sounds?

Personally Identifiable Information is any data that can be used to identify a specific individual. On the surface, that sounds straightforward. In practice, it is considerably more nuanced.

PII generally falls into two categories: direct identifiers and quasi-identifiers. Direct identifiers are data points that are unique to an individual on their own. A full name, a Social Security number, a passport number, or a biometric record are all direct identifiers. If any of these appear in a dataset, they can identify a person without needing any additional context.

Quasi-identifiers are different. Individually, a date of birth, a postal code, or a general demographic such as race or gender cannot pinpoint a single person. But when combined, they can. A landmark study by Latanya Sweeney at Carnegie Mellon University demonstrated that 87% of the U.S. population could be uniquely identified using just three data points: date of birth, sex, and five-digit ZIP code. This is why quasi-identifiers are often treated as seriously as direct ones in mature privacy programs, and why companies that think they have removed the "obvious" PII from a dataset may be further from true de-identification than they realize.

Adding to this complexity is the fact that the definition of PII is not fixed. It varies across jurisdictions, and it continues to evolve as regulators respond to new technologies and new ways of inferring identity from data.

How Does the Definition of PII Vary Across Regulations?

There is no single, globally accepted definition of PII. Different regulatory frameworks define it differently, and the gap between some of these definitions is significant enough to create real compliance risk for companies operating across borders.

In the United States, PII is generally understood as information that can be used to distinguish or trace an individual's identity, either alone or when combined with other personal or identifying information. Various federal and state regulations add their own layers of specificity: HIPAA defines protected health information (PHI) with its own set of 18 direct identifiers, while CCPA in California extends privacy rights to a broad range of personal information including inferences drawn from consumer data.

In Europe, the General Data Protection Regulation takes a notably broader approach. Under the GDPR, personal data is defined as any information relating to an identified or identifiable natural person. This includes not just traditional identifiers but also location data, online identifiers such as IP addresses, and factors specific to an individual's physical, physiological, genetic, mental, economic, cultural, or social identity. It is worth noting that PII and "personal data" are related but not identical concepts, and the GDPR's definition of personal data is intentionally broader to capture the full range of ways a person can be identified in a digital environment.

What this means for companies is that a data handling practice that is compliant in one jurisdiction may create liability in another. A dataset that qualifies as de-identified under HIPAA's Safe Harbor method may still contain information that would be considered personal data under the GDPR. Organizations that operate globally, or that handle data from individuals in multiple countries, need to build their privacy programs around the broadest applicable standard, not the most permissive one.

Staying current on annual regulatory changes is not optional for compliance teams. The regulatory landscape for PII continues to develop, and what constitutes personal data under law in 2025 is meaningfully different from what it was five years ago.

Who Is Responsible for Protecting PII?

This question comes up often, and the answer is less ambiguous than many companies would prefer it to be: any organization that collects personal data from individuals is responsible for protecting it.

That responsibility does not transfer when data moves to a vendor or a cloud environment. Under most modern privacy frameworks, including the GDPR, companies act as either data controllers (who determine the purpose and means of processing) or data processors (who process data on behalf of a controller). Both bear distinct legal obligations, and the controller remains ultimately accountable for ensuring that any processor it works with provides sufficient data protection guarantees.

As consumers continue to demand greater privacy, this accountability is no longer simply a legal formality. It is a business and reputational issue. High-profile breaches and regulatory penalties have made clear that the cost of failing to protect PII is not limited to fines. It includes lost customer trust, damaged brand equity, and in some industries, the loss of operating licenses or contracts.

For companies in regulated sectors, this is especially pronounced. Healthcare organizations face HIPAA enforcement and HHS audit risk. Financial services firms navigate a web of federal and state-level obligations including GLBA, NYDFS requirements, and sector-specific SEC guidance. Pharmaceutical and life sciences companies must protect patient data in clinical trial documentation, adverse event reports, and real-world evidence datasets. Insurance carriers handle sensitive health and financial data under frameworks that vary significantly by state. Contact centers routinely capture personal information in call recordings and transcripts that may be subject to multiple overlapping regulations.

The common thread across all of these sectors is that responsibility for PII protection cannot be delegated away. It has to be actively managed.

Free Resource Bundle

Your PII detection has gaps.
Here's the data to prove it.

Benchmark report, enterprise case study, and a 15-point production-readiness checklist — free for engineering teams evaluating PII detection.

↓ Benchmark Whitepaper

↓ Boehringer Case Study

↓ Readiness Checklist

Access the Resources

What Does It Actually Take to Protect PII?

Understanding the scope of your PII exposure is the essential first step. Companies often underestimate how much personal data they collect, where it lives, and in what formats it appears. PII does not only exist in structured databases with clearly labeled fields. It appears in unstructured formats too: support tickets, call recordings, clinical notes, email threads, PDF documents, and chat logs all frequently contain personal information that is not captured in any formal data inventory.

Before a company can protect PII, it needs to understand what it has and where it lives. That requires systematically scanning not just structured data environments but unstructured text as well.

Once a company has that visibility, the next priority is evaluating its data management protocols. What controls are in place to limit access to PII? How is PII handled when data is shared internally for analytics, or externally with vendors and partners? Are there documented retention policies that ensure data is not held longer than necessary? These are the operational questions that determine whether a privacy program exists on paper or in practice.

Is It Possible to Remove PII Completely?

This is one of the most common misconceptions in data privacy, and it is worth addressing directly.

No de-identification process can guarantee that 100% of PII has been removed from a dataset. This is not a failure of technology. It is a mathematical reality of the problem. Quasi-identifiers can always be recombined in ways that were not anticipated. New inference techniques can extract identity signals from data that was previously considered safe. The context in which data appears can make it identifiable even when individual fields have been redacted.

What good de-identification does is reduce the risk of re-identification to a level that is acceptable under applicable regulatory standards, while preserving enough of the data's utility for the intended use case. Limina's whitepaper on redaction accuracy provides a detailed technical look at what this means in practice, including how different de-identification approaches perform against real-world text.

This is precisely why accuracy matters so much in a de-identification solution. Missed entities are not an acceptable margin of error when the stakes involve regulatory compliance or patient safety. At the same time, over-redaction that strips useful context from a document defeats the purpose of making the data available for secondary use in the first place.

What Should Companies Look for in a PII Protection Solution?

When evaluating whether to build a de-identification solution in-house or work with an external vendor, companies should be realistic about what each path requires and what gaps each leaves.

Building in-house can seem appealing because it offers control. But maintaining a PII detection engine that stays current with evolving entity types, multiple languages, and the nuances of unstructured text is a significant engineering and operational investment. Pattern matching approaches, which rely on predefined rules and regular expressions, are fast to implement but brittle in practice. They miss entities that fall outside expected formats, and they have no ability to understand the context in which a name, number, or identifying phrase appears.

This is the core limitation of rule-based de-identification: it treats PII as a formatting problem when it is actually a language understanding problem. A patient's name might appear in the body of a clinical note as part of a sentence, not in a labeled field. A financial account number might be written out in words, not digits. A quasi-identifier combination might be spread across multiple paragraphs in a way that no pattern-matching rule could catch.

Limina's data de-identification solution was built by linguists, which means it is designed to understand language, not just scan for it. Rather than relying on pattern matching alone, it understands entity relationships and the contextual signals within a document that determine whether a piece of information is identifying. The result is de-identification that is more accurate across more entity types and more languages than approaches built purely on rules or general-purpose machine learning models.

That level of precision matters in practice. Limina processes more than 70,000 words per second, detects over 50 entity types, and supports 52 or more languages, with a redaction accuracy rate above 99.5%. Organizations like Providence Health and Boehringer Ingelheim rely on this precision because their compliance and data use requirements do not leave room for error.

If your organization is evaluating its current approach to PII protection, this is the right time to take stock of both the solution's capabilities and its documented limitations. The right vendor should be able to show you, with evidence, where their approach performs and where it does not.

Speak with the Limina team to get a clear picture of what accurate, context-aware de-identification looks like for your specific data environment.

Building a Privacy Program That Holds Up

Protecting PII is not a one-time project. It is an ongoing operational responsibility that evolves alongside your data environment, your technology stack, and the regulatory frameworks that govern your industry.

The companies that handle this well tend to share a few characteristics. They know what data they collect and why. They have documented processes for handling that data across its lifecycle, from collection through deletion. They evaluate their vendors and tools with the same rigor they would apply to any other compliance function. And they treat privacy not as a checkbox but as a commitment that is reflected in how products are designed and how data decisions are made.

That last point is becoming increasingly important as AI adoption accelerates. Machine learning models trained on data that contains PII can inadvertently expose that information through their outputs. Systems that use PII-containing data for analytics, fine-tuning, or knowledge base construction create risks that did not exist in purely transactional data environments. Building privacy into AI workflows from the beginning, rather than attempting to retrofit it later, is both a compliance requirement and a competitive advantage for companies that handle sensitive data.

If your organization is looking to evaluate or strengthen its approach to PII protection, Limina's de-identification platform is purpose-built for the regulated industries where the stakes are highest.

‍

Share this post

Copy link

What Companies Should Know About PII & Protecting It

What Companies Should Know About PII and How to Protect It

What Is PII, and Why Is the Definition More Complicated Than It Sounds?

How Does the Definition of PII Vary Across Regulations?

Who Is Responsible for Protecting PII?

What Does It Actually Take to Protect PII?

Is It Possible to Remove PII Completely?

What Should Companies Look for in a PII Protection Solution?

Building a Privacy Program That Holds Up

Related Articles

HIPAA vs GDPR: How Health Data Privacy Differs Between the U.S. and Europe

Unstructured Data Examples: The Hidden Privacy Risks in Emails, PDFs, and Chat Logs

Manual vs. Automated PII Redaction: Pros, Cons, and When to Use Each