An Overview on Production, Anonymous and Synthethic Test Data
To get an understanding of the forms of test data which are used in organizations and management, I will give you an introduction into the matter. These definitions are true for the different testing levels from unit test to integrated end-to-end testing.
Usually, the straight forward and fast way to access test data is to copy it from the production environment. This poses few technical hurdles but in most companies and countries there are an ever increasing data privacy and protection laws / policies which – with good reasons – restrict the use of real data for testing.
For most test cases there is a good coverage to be found in a production set. Although there are limitations. For example, when testing new features or potential combinations which have not yet occurred in production.
Thus, there is currently a shift taking place moving away from using production data in testing towards more sophisticated forms of anonymous and synthetic test data.
The evolution of using production data is masking (anonymizing) the data copied over from the production environment before it is loaded into the testing systems. If done sufficiently, this negates the compliance drawbacks of production data but it does not alter the data topology. Thus the data availability for testing stays the same.
The main challenge in anonymizing, is the altering of data so there can be no reconstruction or recognition of real personal or business information and coincidences without breaking the data integrity which would make it unusable either from a technical or testing point of view.
For example, if you would simply shuffle data in the production set, you might end up with a teenager with a business contract and multiple residence addresses in multiple countries. Which might break the system as it is not able to be loaded or it might not satisfy the requirements of any test case.
The degree of masking data has to be chosen according to the target audience. The possibilities of reconstructing real information have to be weighed, and one has to think about meta level attack vectors like insights about the distribution of different customer types or the amount thereof.
Additionally, masking the exclusion of data from the set is a powerful way to reduce the effort required for sufficient masking. Although it depends on the use case.
Data which is not copied but generated for a purpose other than the production use is synthetic. This approach is functional, the complete solution regarding compliance and data availability. It allows for creating the exact set of data needed for all test cases and beyond.
Since you have to reimplement the essence of the business logic and rules which lay within the whole system to generate usable data and find ways to load every part of the system, it requires large technical and organizational efforts both initially and in maintenance. Additionally, every test case has to specify exactly what is necessary because there is no pre-existing data which can be found. Resulting in higher requirements to test data design.
Synthetic data can be an enabler for organizations as it allows for sharing of data with minimal concerns about compliance. For example, with third party vendors. It can open doors for use cases like load and performance testing or test automation as the data generating application should both be able to create big data sets and offer technical interfaces.
Usually, synthetic data creation also forces the issue of data distribution to testers to be addressed properly. As there is a closed loop of ordering, generating and distributing or accessing the needed data to be implemented in a central location. This reduces the effort for testing and delivery teams for test data handling which has a scalable return on investment.
Contrasting the high price for synthetic data with the cost for the copy approaches – production and anonymous test data – you have to take into account the shadow number of effort that these solutions put on the teams requiring the data for testing. In our experiences with customers we know that ~70% of the everyday effort for test data lies in finding and managing required data and ~30% in creating what is not represented in production. Additionally, it puts a burden on every team to seek a solution for finding and creating data which results in many different solutions for the same or similar task inefficiently. This usually happens under the radar of management which could enable the introduction of overarching and scalable solutions for test data creation and accessibility.
You are now aware of what the different forms of test data are and have gained some insights into what has to be considered with each. It should be more clear now, that there is no black and white answer to the best solution. The use cases, needs and structure of your organization have to be considered carefully before making a decision. Especially, when calculating the actual cost each method would have.
Consultant: Patrick Stalder