In a previous post, I talked about the challenges facing anyone attempting to strip the meaning from data without losing its context or immediate usefulness. If your data contains sensitive information such as names, dates of birth, credit card numbers, and the like, how do you sanitize it without losing the context or usefulness of the data itself?

Anonymization (sometimes called Obfuscation) is the process of doing exactly that. We don’t remove the data, we simply alter it to become meaningless but retain its essence. Where necessary, we can also maintain its relationship to other data so that your dataset as a whole is still useful for purposes such as broad analysis or development/testing activities.

There are many products on the market which claim to provide anonymization. We can discount most of them at first glance simply because they either don’t make any attempt to mimic the original data’s look and feel or because they are simply too limited or simplistic. Now when we look at what’s left we start seeing some very serious looking price tags and as with most data tools they require a considerable level of developer expertise to use successfully.

Conductor’s primary purpose is to move data from A to B with minimal fuss or technical knowledge. We began fielding a lot of questions from customers who wanted to anonymize their data on the way through. Given the lack of reasonable alternatives on the market, we set about building anonymization into Conductor.

There are many shortcuts and half measures available to the undisciplined practitioner in this area. However, we wanted the data to look exactly like the original, but meaningless. We also wanted the data to be consistent between subsequent generation cycles in order to make it as useful as possible for testing activities where reproducibility is important. And as with everything else in Conductor, it had to be ‘point and click’ without complex configuration or developer effort. Perhaps most importantly, there needed to be no way to reverse engineer the anonymization process in order to retrieve the original data.

How it works:

Conductor profiles the data in your source table. From this, we use smart heuristics to choose the best anonymization approach for each field. Conductor uses more than 40 different approaches to provide anonymization for hundreds of different types of data. Users can change the anonymization approach for each column or choose not to anonymise columns at all.

When a process is executed, data is read from the source data store and checked as normal. Then it is passed through the anonymization filter and the newly generated data is sent to the destination data store.

Each time a cell is anonymized, Conductor generates the exact same data as the last time, provided the data has not changed. This means you can reliably copy a consistent set of data from your production system back into your test system and know that your test procedures will produce the same results each time.

Almost all of our data anonymization approaches are intentionally ‘lossy’. This means that part of the nature of the original data is deliberately removed in the anonymization process in order to prevent anyone from reverse engineering your data to retrieve the original. The only exception is with foreign keys, where the anonymisation must produce unique data for reference purposes – no two keys can be accidently the same.

So what does all this actually look like? Simple:
Billy Connolly becomes Tom Forbes,
12 May 1974 becomes 27 Jan 1948,
5017-8759-0015-4181 becomes 1131-7814-3224-0557

…and so on.

There is obviously a nearly infinite number of possible types of data in the world and it’s simply not possible to cater for all of them. However, we will continually improve Conductor, providing more and more accurate anonymization for the most common types of data.


About the Author: