Request a demo

5 Data Masking Techniques and Why You Need Them

afranczyk - August 15, 2022

What Is Data Masking?

Data masking is a method of creating structurally similar but non-realistic versions of sensitive data. Masked data is useful for many purposes, including software testing, user training, and machine learning datasets. The intent is to protect the real data while providing a functional alternative when the real data is not needed.

Many organizations have robust data security controls to protect their production data but much less stringent controls when data is used for non-production purposes. This can create major security and compliance risks, especially when data is used by third parties outside of the organization’s control. Data masking can alleviate these concerns, ensuring that whenever data is transferred outside production environments, it is masked to prevent compromise.

An important principle of data masking is that the data format remains the same—only the values change. Data can be modified in a variety of ways, including encryption, character shuffling, or dictionary substitution. The objective is to ensure that unauthorized parties who obtain the masked data will have no way to reconstruct the original data.

Why Is Data Masking Important?

Data masking is critical to data security because it can limit the impact of a data breach. Consider a database table with sensitive financial data and less sensitive customer data. Sales personnel should have access to the customer data but not the financial data. Using data masking, the financial data columns can be masked, and if an attacker compromises a salesperson’s account, they will not be able to access the financial data.

For the same reason, data masking can protect against insider threats. If a malicious employee attempts unauthorized access to data using their account, the data will be masked, limiting the damage they can do.

Many data security regulations explicitly require data masking for their definition of protected data. For example, the General Data Protection Regulation (GDPR) requires that personally identifiable information (PII) should be masked using techniques like anonymization and pseudonymization (replacing data with similar data that does not expose the identity of a living person).

Related content: Read our guide to data privacy (coming soon)

Types of Data Masking

1. Static Data Masking

Static data masking is most commonly used for production database backups. It adjusts data so that it can be used for development, testing, and training without divulging sensitive information. It works as follows:

  • Create a “golden copy” of the production database and move it to a secure location.
  • Delete and mask sensitive data columns.
  • Save a copy of the masked database to the insecure development, testing, or training environment.

2. Dynamic Data Masking (On-the-Fly Data Masking)

Unlike static methods, dynamic data masking does not require copying the database to a new environment to create masked data. Instead, data is kept in the original database, and a dynamic mechanism masks the relevant data depending on the authorization of the current user account. This ensures only authorized personnel can access sensitive data. Data masking is supported by all major commercial databases and can also be implemented via reverse proxy.

Dynamic data masking is suitable for organizations that continuously deploy software or have databases that are integrated with many other systems, making it impractical to perform static data masking.

Related content: Read our guide to data masking in Oracle (coming soon)

3. Deterministic Data Masking

Deterministic data masking maps original values to masked values, ensuring that data is always replaced consistently in all tables. This can be important to retain data integrity. For example, if the data contains names, the name “John” will always be replaced with “Samuel” in all relevant tables.

4. Statistical Data Obfuscation

Production data often includes numerical data and can be masked through statistical techniques. For example, data can be aggregated using summation, averages, or means, or the data can be described using histograms without sharing the underlying data values.

5. Unstructured Data Masking

Sensitive data is not limited to database tables. Scanned documents and image files, such as identity documents, insurance claims, and financial documents, can also contain sensitive data. Unstructured data masking relies on optical character recognition (OCR) and ensures that regions in an image containing sensitive information are blurred or replaced with alternative data.

Data Masking Techniques

Here are some notable data masking techniques.

Data Encryption

Data encryption is one of the most effective and widely used data masking techniques. Encryption algorithms convert raw data into an unreadable format, which users can only view using a secret decryption key. No one can read the data without the decryption key.

Encryption is suitable for data in action that must have the ability to revert to its original form. Encrypted data is only secure if access to the decryption key is limited to authenticated users. If a key is compromised, an unauthorized user could decrypt the sensitive data and view it in its raw form. Secure key management is thus essential.

Data Scrambling

Data scrambling is a simple masking method that jumbles the data into a random, unrecognizable string of characters. While this technique is easy to implement, it only works with certain data types. It is not the most secure data masking approach, making it unsuitable for many sensitive use cases.

For example, an employee ID might be a number—687514. Once scrambled, the ID will contain the same numbers in a different order—716854. An unauthorized user could easily guess the original code by playing around with the order.

Data Substitution

Substitution involves replacing the original data with different values. It is a particularly effective masking technique that preserves the data’s original qualities without exposing its real content. However, substitution only works with specific data types like lists of items in a certain category (e.g., a file containing user names). It is also a more complex data masking method to implement.

Data Shuffling

The shuffling technique is a form of data substitution that retains the original data but rearranges the order. For example, a randomized table of user names might have real names in different columns. The shuffled data looks real but doesn’t reveal the true information about the items it lists. The drawback of this approach is bad actors can easily reverse engineer the shuffled data if they understand the algorithm.

Data Pseudonymization

The term pseudonymization, coined by the EU’s GDPR, refers to various ways to protect personal information, including encryption, shuffling, and hashing.

The pseudonymization process prevents unwanted individuals from identifying individuals based on their data. This includes eliminating direct information about a person’s identity and any unknown indicators that a hacker could use to identify an individual. It is important to protect pseudonymized data by storing the encryption keys and any secrets to recover the original data securely and separately.

Data Nullification

Nullification applies null values to columns of data, preventing unauthorized users from seeing the real data. It’s easy to implement, but it can impact the data integrity—this is usually a problem in development and test environments.

Date/Number Variance

The variance technique helps obscure sensitive information about financial and other transactions, such as the dates of financial activities. For instance, date/number variance can mask salary tables by showing the salaries from lowest to highest. It can guarantee data integrity by applying a small variance (i.e., 5%) to all the salaries in the table.

Data Masking with Pathlock

Pathlock Security Platform’s dynamic data masking capabilities provide fine-grained control over which sensitive data fields can be masked for any specified user in the context of any situation. Pathlock allows companies to:

  • Centralize data masking enforcement throughout your ERP ecosystem with a single ruleset.
  • Deploy dynamic policies that account for risk based on the context of access, such as location, IP address, time, data sensitivity, and more.
  • Protect sensitive data in production and non-production environments.
  • Align data masking controls with existing governance (corporate) policies.
  • Mask sensitive PII based on the data subjects’ residency (country/nationality).
  • Mask data fields in transactions (Tcodes) that are unnecessary for a role.

Get in touch with us for a demo and see for yourself how Pathlock can improve data security and reduce compliance risk with a fully dynamic data masking solution.

Table of contents