Previously, on the token blog, we pondered various approaches to protecting personal data from exploitation. The strategies include perimeter maintenance (the “secure data environment”), well-defined and -enforced access policies, encryption of data in transit and at rest, and tokenization. The wide deployment of firewalls and network partitions, authentication and authorization services, TLS and encrypted databases address the first three of these categories. The subtleties of tokenization, on the other hand, merit deeper attention.
Personal Data Responsibility
Tokenization substitutes a “token” for a single value. Take a user profile, for example. A service might require a username and password for authentication, an email address for password resets, and a brief user bio. Each field contains personal or potentially-personal information. A single record might look like this:
bio: VP Engineering at Supercool Analyticz, LLC. Wife of eleanorrigby. Celica afficionato. Go Fighting Ducks!
The sensitivity of the email address, password, and phone number is self-evident. But even the bio can be revealing: identity thieves can use this information to spoof location based on company name and guess answers to typical KBA questions (“first car”, “college attended”). Some services make this information public, but for a site that requires mutual consent to share a bio, the contents might be more sensitive. Loss of such data would violate the trust between user and service provider.
Data processors and controllers have a responsibility to take reasonable steps to protect personal data entrusted to them, not only from external breaches, but also from internal exposure. Employees mustn’t see such data in the clear, not only to prevent leakage, but also to reduce bias. Disclosure of the barest of personal information may play into the blind spots of even the most responsible of people, potentially leading to unfair treatment and outcomes. Imagine a support person uncomfortable about gay marriage seeing the above bio. Might it cause them to treat the user differently than others?
Far better to expose as little information as possible. However, the requirements for each field vary depending on usage. Let’s leave aside the password field, for now; we expect that the self-evident secrecy requirements long ago led most organizations to adopt password hashing to protect passwords. Focus instead on the other fields. They’re not inherently secret, but personal, still deserving of protection from unnecessary disclosure. What are the requirements for tokenizing the username, email address, and bio?
Let’s lay them out:
- Username: Must be unique across all users of the system. Used purely to look up a user record for authentication. Never used or displayed in any way.
- Email: Must be unique across all users of the system. Used to send password reset messages, as well as notifications for authentication from new devices.
- Bio: Free form text displayed on the user’s personal page, to be shown only to the user and other users to whom the user has allowed access.
These requirements demonstrate the two dimensions of tokenization: reversibility and determinism. Reversible tokens may be detokenized to recover their original values. Deterministic tokens are always the same given the same inputs. For example, the phone number +1-503-987-3456 might be tokenized as OIGM09jeWSEz_yNN-oXMrQ, and must be tokenized with exactly that string every time. This contrasts with non-deterministic tokens, where each tokenization of +1-503-987-3456 returns a different token.
A quadrant graph nicely illustrates the options created by these dimensions:
In truth, the top left quadrant, non-reversible & deterministic, is traditionally filled by cryptographic hash functions, such as SHA–256. Similarly, the bottom-right quadrant, reversible and non-deterministic, corresponds to symmetric cryptography modes such as CBC-MAC. Of course the bottom-left quadrant, non-reversible and non-deterministic, isn’t useful at all. It’s that top-right corner, requiring deterministic, reversible values, that’s the sweet spot for tokenization.
Returning to the user profile, the fields map to the dimensions as follows:
- Username: Deterministic and Non-reversible. Need to find the record for authentication, and the value is never displayed anywhere.
- Email: Deterministic and Reversible. Need to ensure email is unique across accounts and to recover the original value in order to send messages the user.
- Bio: Non-deterministic and Reversible. No uniqueness or lookup requirement, but do need to recover original values for display in the UI.
Although one could rely on a cryptographic hash function to tokenize the username, and a crypto library to protect the bio, we find it useful to adopt a tokenization strategy that covers all three use-cases. The consistency of interface ensures consistent treatment of values, easing the protection of data with different requirements without additional effort. When the goal is to protect all personal data, it’s easiest to adopt a solution the properly protects all personal data.
We believe it worthwhile to evaluate tokenization solutions that encompass these tokenization dimensions, where a single solution encapsulates fully-vetted and -audited, industry-standard tokenization, encryption, and hashing algorithms in a single solution. Applying the patterns to our example user profile, the data now becomes safe to show to employees, and useless to identity thieves:
To be Continued
Alas, reversibility and determinism cover only a subset of the considerations when it comes to tokenization. Other variables to weigh include data type preservation, data storage strategies, and regulatory compliance vetting. We’ll cover those topics in future posts.