A large solutions and services company facing strict compliance regulations and enforcements needed a powerful, scalable enterprise data protection solution for their data being migrated over to S3, Athena, Amazon Redshift, and Glue environments. The sensitive data included HR, Financial and customer information. Using Protegrity’s field-level data protection, the company overcame this challenge, significantly improving their processes.
Protegrity Tokenization
Tokenization is a non-mathematical approach to protecting data while preserving its type, format, and length. Tokens appear similar to the original value and can keep sensitive data fully or partially visible for data processing and analytics. Historically, vault-based tokenization uses a database table to create lookup pairs associated with a token with encrypted sensitive information.
Protegrity Vaultless Tokenization (PVT) uses innovative techniques to eliminate data management and scalability problems typically associated with vault-based tokenization. Using different Amazon services, such as Redshift, S3, RDS, and Athena with Protegrity, data can be tokenized or de-tokenized (re-identified) with SQL, depending on the user’s role and the governing Protegrity security policy.
Here’s an example of tokenized or de-identified personally identifiable information (PII) data preserving potential analytic usability. The email is tokenized while the domain name is kept in the clear. The date of birth (DOB) is tokenized except for the year. Other fields in green are fully tokenized. This example tokenization strategy provides the ability to do age-based analytics for balance, credit, and medical.
Protegrity for Amazon Services
Protegrity, a global leader in data security, provides data tokenization for AWS services, such as Amazon Redshift, Athena, S3, Glue, EMR, Kinesis, RDS, etc., by employing a cloud-native, serverless architecture.
The solution scales elastically to meet Amazon services’ on-demand, intensive workload processing seamlessly. Serverless tokenization with Protegrity delivers data security with the performance organizations need for sensitive data protection and on-demand scale.
Solution Overview
To comply with federal regulations, the services company wanted to implement a solution that could scale with their increasing data usage along with being able to deploy it across the enterprise on different platforms with a single pane of protection while passing the federal audit. Since the sensitive data existed across on-prem and cloud platforms, the solution needed to provide the flexibility to protect anywhere and be able to unprotect anywhere. Additionally, the solution also needed to provide role-based access control, to unprotect the sensitive data only for the individuals who had the authority to see the data in the clear.
After working discovery sessions with Protegrity, the customer decided to deploy Protegrity’s Enterprise Security Administrator (ESA), along with Application Protector for C/C++, Java and Cloud Protect for AWS, amongst other features for data protection. Sensitive data resided on an on-prem cluster, which would then flow through the S3 bucket using Informatica and be persisted into AWS Redshift and RDS clusters to run analytics on top of it.
With Protegrity’s Application Protector for C/C++, data flowing through Informatica to the cloud was protected prior to moving to the cloud. The unprotected data landing directly within Amazon S3 was protected using Protegrity’s Cloud Protect API. For unprotection on AWS, Amazon Redshift, and Athena were used, while for Amazon RDS, the cloud API was leveraged.
Solution Architecture for Amazon S3
Protegrity built an ETL process using two Amazon S3 buckets to separate the zones for input data and output data, as well as a landing zone for incoming sensitive data and a processed zone for protected data stores the resulting protected data.
The S3 protector is triggered as new files land in the landing zone bucket, and it reads and processes the data based on a configuration file. The protected data, meanwhile, is written to a file in the processed zone bucket. The protected data can be the basis of a secure data lake for Amazon Athena or Amazon EMR, or loaded into a data warehouse such as Amazon Redshift and Amazon RDS.
Protegrity’s Cloud Protect offers protectors for these services and enables authorized users to unprotect the data on read.
The S3 protector supports the following file formats:
- Text formats (comma-delimited, tab-delimited, custom)
- Parquet
- JSON
- Excel
* Files may be optionally gzipped.
The Cloud Protect S3 solution is deployed on AWS Lambda and invokes the Protegrity Cloud API on AWS to protect the data.
The solution scales to process thousands of files in parallel or up to regional AWS quotes. A separate Lambda instance is used to process each file so there’s an upper file size based on the Lambda timeout period or files up to approximately 3 GB. However, larger files can be split to provide greater parallelism and ensure processing can be completed within the maximum 15-minute Lambda timeout period.
Below are example benchmarks for different CSV file sizes:
Solution Architecture for Amazon Redshift
Amazon Redshift Lambda UDFs are architected to perform efficiently and securely. When you execute a Lambda UDF, each slice in the Amazon Redshift cluster batches the applicable rows after filtering and sends those batches to your Lambda function.
The federated user identity is included in the payload, which Lambda compares with the Protegrity security policy to determine whether partial or full access to the data access is permitted.
The number of parallel requests to Lambda scales linearly with the number of slices on your Amazon Redshift cluster. It can perform up to 10 invocations per slice in parallel, satisfying regulatory compliance requirements and improving enterprise data protection. To learn more, see the Amazon Redshift architecture.
The external UDF integration with Lambda efficiently scales with cluster size and workload. The following table shows real benchmarks with Protegrity and Amazon Redshift with throughput exceeding 180M token operations per second (6B token operations / 33.1 seconds).
Median Query Time(s) – Cluster vs. # Token Operations
Conclusion
In conclusion, Protegrity’s agile and scalable solution met the rigorous data protection standards required by the services company, allowing them to store customers’ Personally Identifiable Information (PII) in a protected (tokenized) format. The seamless deployment across various enterprise platforms ensured federal regulatory compliance, providing customers with peace of mind regarding their PII data security. Additionally, the enterprise data protection solution reduced the time to insights by 30% and increased data utilization for analytics by 40%, as the data was already protected, eliminating the need for additional security measures before analysis.