Have you ever considered taking your production data and copying it into your test system so your developers and testers have ‘real’ data to play with? If so, how did you deal with personal information that should not be shared?

At Eight Wire, we set out to solve this problem with anonymization. In keeping with the simplicity of Conductor, we wanted to make it as simple and painless as possible. But first, we thought we’d look at the alternatives for doing it yourself.

The most obvious approach is to simply remove anything sensitive or blank it out. This might be acceptable if you’re handing the data off to someone who doesn’t care, but if you are using it in your own systems, this probably won’t work or will make the data practically useless.

Another way to protect yourself is to extend your production security policies and procedures to cover your entire company, or get all of your staff to sign nondisclosure statements. The problem is, putting the ambulance at the bottom of the cliff doesn’t help when there is a sensitive data breach. The damage has already been done.

An even better approach is to randomly anonymise sensitive data. This means replacing the sensitive data with something equivalent but innocuous. For example, a date of birth gets replaced by a random date somewhere in the past 100 years.

This is fine, but you’ve broken the repeatability of your data. Testers will want to start each test run with exactly the same set of data they used last time, so they can tell if they’re getting consistent results. This makes a completely random approach single-use and will add overhead to every release.

The way around this is to use your original data to ‘seed’ the randomization process. For example, each time ‘2014-11-12’ is randomized it will always generate ‘1969-07-20’, according to some hidden formula. You can do the same with numbers and other simple types of data.

Generating dates and numbers is relatively simple. What about addresses, credit card numbers and people’s names?  Creating anonymized data that still looks like where it came from, while protecting the original data, leads to hundreds of hours of development and testing before you can start work.

This is where we come in. Conductor can now use use your source data to randomly seed changes that are repeatable for each run. The data is anonymized without adding any additional overhead to development teams. Data structures are classified as names, addresses, or identification numbers and source data is used to seed the new values. This is all accomplished by you checking a single box and Conductor doing the rest.

In the next post, we’ll go into detail about how we did this and what it can mean for data integration teams that need to work with data without knowing what that data is.

“Errors using inadequate data are much less than those using no data at all.”

– Charles Babbage


About the Author: